Google Genkit
Streamlining prompt refinement and evaluations in AI development workflows.
Role
Product Designer
Timeline
September 2025 - December 2025
Team
2 Project Managers,
6 Product Designers
Toolkit
Figma, Figjam
OVERVIEW
I worked with Google's Genkit team to improve the developer experience for building AI-powered applications. Genkit is a model-agnostic framework that helps developers build, test, and deploy AI features. However, being newly launched in February 2025, many of its tools were still rough around the edges.
THE CORE PROBLEM
Developers were frustrated by fragmented workflows.
Building, testing, and evaluating AI experiences required constant context-switching between different tools and platforms, slowing iteration and making debugging painful.
MY IMPACT
To address these pain points, I shipped the design of two features:
Autocomplete for Prompt Writing
A Gemini-powered suggestion system triggered by typing a backslash. As developers write prompts, contextual suggestions appear to help them refine phrasing, reducing the need to switch to external tools for prompt optimization.
Evaluation Panel in Prompt Runner
A panel that lets developers run evaluations directly from the prompt editor, without navigating away. Users select evaluators, choose how many times to run their prompt, see summary metrics immediately, and can navigate to detailed results when needed.
Presenting our final designs to the Genkit team at Google!
COMPETITIVE ANALYSIS
Understanding the AI development landscape
We started by analyzing 11 competitors to understand what developers expect from AI development tools and where Genkit could differentiate.
After analyzing the strengths, weaknesses, and opportunities for each, we identified 4 key insights:
1.
AI-assisted prompt optimization is becoming expected.
OpenAI's optimizer with reasoning, PromptLayer's before/after highlighting, and Prompt Perfect's dedicated optimization panel all signal that developers expect AI help improving their prompts.
2.
Evaluation tooling is a competitive expectation.
Built-in evaluators, A/B testing (PromptLayer), and model comparison views (LangSmith, OpenAI) are becoming standard, as teams shipping production AI need to measure prompt performance systematically.
3.
Model agnosticism is valuable.
Most competitors lock users into their ecosystems; LangSmith ties to LangChain, Claude Console and OpenAI Platform only work with their respective models. Genkit's model-agnostic positioning is a key differentiator.
4.
Simplicity wins adoption.
Claude Console's beginner-friendly UI and PromptLayer's clean interface stand in contrast to complaints about LangSmith's density and Microsoft Foundry's complexity.
Fun tidbit: As of December 2025, Graphite (one of the competitors I analyzed) has been acquired by Cursor— a reminder of how fast this space is moving!
USER RESEARCH
Interviewing 20 AI developers
To ground our designs in real developer experiences, we conducted semi-structured interviews with both Genkit users and developers using other frameworks. We wanted to understand their workflows, pain points, and mental models around prompt engineering and evaluation.
Through affinity mapping, my team and I synthesized the following key insights:
1.
Prompt refinement often happens outside the tool.
"I'll usually rewrite the prompt from scratch, test it, drop it into a different AI to refine and tweak until it hits exactly what I want, then re-paste the new version back in."
2.
Iteration cycles are slow and manual.
"When I make a change, I want to immediately go see if it's working… I spend more time waiting than actually working."
3.
Evaluation is neccessary, but it's not structured or integrated.
"I used to build my own tracking system, but it was a pain because I switched between multiple AI providers, and all of them have different data structures."
USER PERSONAS
Who are we designing for?
Based on interview patterns, we developed two personas representing distinct developer archetypes Genkit serves:
JOURNEY MAPPING
Piecing together the developer journey
We then mapped their typical prompt engineering workflow to identify where developers lose momentum.
DESIGN PROCESS
Ideation
To encourage divergent thinking, each team member sketched ideas individually before presenting to the group. I anchored my pen-and-paper explorations in the user needs from research: faster iteration, AI-powered guidance, and reduced tool-switching.
From there, we grouped all our features into 6 key areas. During biweekly syncs with the Genkit team (designers, PMs, and engineers), we presented these concepts and received feedback on balancing vision with feasibility. Some ideas, like node-based visual programming, were exciting but not near-term implementable.
Client direction: We received feedback to focus on features that improve the prompting workflow. This became our guiding question: How might we improve the developer prompt engineering experience to streamline AI-powered workflows?
Feature 1: Autocomplete for Prompts
I led the design of a Gemini-powered autocomplete feature to address the pain point of constant tool-switching during prompt authoring. The feature provides contextual suggestions as developers type, reducing the need to leave Genkit for optimization help.
KEY DESIGN DECISION
Entry point: Backslash trigger
I explored several trigger mechanisms: keyboard shortcuts, always-on suggestions, explicit buttons, and highlighting after typing. I landed on the backslash (\) as the trigger for a few reasons:
Familiar pattern from Slack, Notion, and other productivity tools developers already use
Intentional— doesn't interrupt developers who don't want suggestions
Easily accessible in the middle of workflows— no need to pause to activate the feature.
FINAL PROTOTYPE
Autocomplete for Prompt Writing
THE PIVOT
Shifting focus to evaluation
Midway, our client asked us to shift focus from prompting features to the evaluation phase. This wasn't a rejection of our earlier work; it reflected the product's evolving priorities and a recognition that evaluation was a bigger pain point than we'd initially scoped.
New design challenge: How might we make evaluation a seamless part of the Genkit workflow so developers can reliably measure LLM quality over time?
I went back to the research insights, particularly the feedback about workflow disruption and context-switching, and applied them to this new problem space.
Feature 2: Evaluation panel in prompt runner
User research revealed slow feedback loops as a major frustration. After mapping the current workflow, I saw why: developers were forced to leave their prompt and set up a full dataset just to test one iteration.
Current workflow (8 steps)
Proposed workflow with evaluation panel (5 steps)
My design, an evaluation panel directly accessible in the prompt runner page, removes that barrier to support quick, single-input evaluations within the natural iteration flow.
KEY DESIGN DECISIONS
Metric selection: Dropdown over checkboxes
My initial approach didn't account for scalability— I hadn't factored in that a teammate was designing a feature for custom evaluators, meaning the metrics list could grow indefinitely.
Checkboxes
All custom metric options visible at once
Overwhelms the UI as more metrics are added
Unclear what other metrics will be evaluated
Dropdown
Scales gracefully as more evaluators are added
Consolidates options into a compact, scrollable view
Also shows default metrics upfront so users know what to expect
Results display: Summary metrics over tables
I went through several iterations on how to display evaluation results within the panel:
Table Within Panel
Visually dense and overwhelming
Competed with the prompt editor for attention
Forces developers to parse detailed data
Table in Modal Dialog
Cleaner separation from the prompt editor
Felt disconnected from the workflow
Extra effort to understand new view breaks momentum
User testing clarified what developers actually needed.
They wanted key metrics first and foremost; quick signals to inform their next iteration. Detailed tables matter for deep debugging, not fast iteration cycles.
This insight led me to rethink not just the UI, but the entire flow:
Summary with Link to Table
Surfaces key metrics immediately without overwhelming the panel
Color-coded scores help developers scan results at a glance
"View inputs and outputs" links to the existing Datasets page for detailed analysis when needed, keeping complex tables out of the panel
Final workflow (Summary first)
Cross-feature coordination: These summary metrics appear across multiple features my team designed (dataset evaluation page, evaluation history). I coordinated with teammates to ensure we were using consistent patterns, and learned to communicate proactively when I updated shared components.
FINAL PROTOTYPE
Evaluation Panel in Prompt Runner
REFLECTIONS
What I learned…
1. Navigating an unfamiliar domain.
I'm not a developer, and before this project, I had no exposure to AI development workflows. Understanding how developers iterate on prompts and evaluate outputs required learning fast; I immersed myself in research, ran user interviews, and asked a lot of questions. This project taught me how to design for complex technical domains I don't initially understand, and gave me the confidence to do it again.
2. Staying flexible through pivots.
When the client shifted focus from prompting to evaluation, I initially felt the earlier work was wasted. But the research insights translated directly— the evaluation panel design is grounded in the same pain points I'd identified early on. Pivots aren't restarts; they're redirections.
3. Designing in an unsettled space.
AI-development tools don't have decades of established patterns to draw from; this space is still new enough that we're shaping what feels intuitive. Decisions were based on limited user testing and adjacent products, like the backslash entry point. Given more time and participants, I'd validate these patterns further.
What I'd do differently…
Establish a shared component library from day one, rather than reconciling after creating components
Run more usability tests throughout iteration, not just at the end of design phases
What's next?
I've worked with clients before, but never at this level of complexity! I learned how business priorities shape design decisions, how to present work to stakeholders, and how to support other designers while owning my own features.
The experience inspired me to take on a Project Manager role for Design Consulting at Cornell in Spring 2026. I'm excited to apply what I learned about client collaboration and navigating ambiguity to leading projects of my own.
Team photo from Google NYC office visit!
























