Google Genkit

Streamlining prompt refinement and evaluations in AI development workflows.

Role

Product Designer

Timeline

September 2025 - December 2025

Team

2 Project Managers,
6 Product Designers

Toolkit

Figma, Figjam

OVERVIEW

I worked with Google's Genkit team to improve the developer experience for building AI-powered applications. Genkit is a model-agnostic framework that helps developers build, test, and deploy AI features. However, being newly launched in February 2025, many of its tools were still rough around the edges.

THE CORE PROBLEM

Developers were frustrated by fragmented workflows.

Building, testing, and evaluating AI experiences required constant context-switching between different tools and platforms, slowing iteration and making debugging painful.

MY IMPACT

To address these pain points, I shipped the design of two features:

Autocomplete for Prompt Writing

A Gemini-powered suggestion system triggered by typing a backslash. As developers write prompts, contextual suggestions appear to help them refine phrasing, reducing the need to switch to external tools for prompt optimization.

Evaluation Panel in Prompt Runner

A panel that lets developers run evaluations directly from the prompt editor, without navigating away. Users select evaluators, choose how many times to run their prompt, see summary metrics immediately, and can navigate to detailed results when needed.

Presenting our final designs to the Genkit team at Google!

COMPETITIVE ANALYSIS

Understanding the AI development landscape

We started by analyzing 11 competitors to understand what developers expect from AI development tools and where Genkit could differentiate.

After analyzing the strengths, weaknesses, and opportunities for each, we identified 4 key insights:

1. AI-assisted prompt optimization is becoming expected.

OpenAI's optimizer with reasoning, PromptLayer's before/after highlighting, and Prompt Perfect's dedicated optimization panel all signal that developers expect AI help improving their prompts.

2. Evaluation tooling is a competitive expectation.

Built-in evaluators, A/B testing (PromptLayer), and model comparison views (LangSmith, OpenAI) are becoming standard, as teams shipping production AI need to measure prompt performance systematically.

3. Model agnosticism is valuable.

Most competitors lock users into their ecosystems; LangSmith ties to LangChain, Claude Console and OpenAI Platform only work with their respective models. Genkit's model-agnostic positioning is a key differentiator.

4. Simplicity wins adoption.

Claude Console's beginner-friendly UI and PromptLayer's clean interface stand in contrast to complaints about LangSmith's density and Microsoft Foundry's complexity.

Fun tidbit: As of December 2025, Graphite (one of the competitors I analyzed) has been acquired by Cursor— a reminder of how fast this space is moving!

USER RESEARCH

Interviewing 20 AI developers

To ground our designs in real developer experiences, we conducted semi-structured interviews with both Genkit users and developers using other frameworks. We wanted to understand their workflows, pain points, and mental models around prompt engineering and evaluation.

Through affinity mapping, my team and I synthesized the following key insights:

1. Prompt refinement often happens outside the tool.

"I'll usually rewrite the prompt from scratch, test it, drop it into a different AI to refine and tweak until it hits exactly what I want, then re-paste the new version back in."

2. Iteration cycles are slow and manual.

"When I make a change, I want to immediately go see if it's working… I spend more time waiting than actually working."

3. Evaluation is neccessary, but it's not structured or integrated.

"I used to build my own tracking system, but it was a pain because I switched between multiple AI providers, and all of them have different data structures."

USER PERSONAS

Who are we designing for?

Based on interview patterns, we developed two personas representing distinct developer archetypes Genkit serves:

JOURNEY MAPPING

Piecing together the developer journey

We then mapped their typical prompt engineering workflow to identify where developers lose momentum.

The journey revealed two key intervention points:

During prompt authoring: Developers break flow, leaving Genkit to search for optimization tips or test variations elsewhere.

During evaluation: The current workflow requires too many manual steps, offers little visibility into performance over time, and creates feedback loops too slow for meaningful iteration.

The journey revealed two key intervention points:

During prompt authoring: Developers break flow, leaving Genkit to search for optimization tips or test variations elsewhere.

During evaluation: The current workflow requires too many manual steps, offers little visibility into performance over time, and creates feedback loops too slow for meaningful iteration.

The journey revealed two key intervention points:

During prompt authoring: Developers break flow, leaving Genkit to search for optimization tips or test variations elsewhere.

During evaluation: The current workflow requires too many manual steps, offers little visibility into performance over time, and creates feedback loops too slow for meaningful iteration.

DESIGN PROCESS

Ideation

To encourage divergent thinking, each team member sketched ideas individually before presenting to the group. I anchored my pen-and-paper explorations in the user needs from research: faster iteration, AI-powered guidance, and reduced tool-switching.

From there, we grouped all our features into 6 key areas. During biweekly syncs with the Genkit team (designers, PMs, and engineers), we presented these concepts and received feedback on balancing vision with feasibility. Some ideas, like node-based visual programming, were exciting but not near-term implementable.

Client direction: We received feedback to focus on features that improve the prompting workflow. This became our guiding question: How might we improve the developer prompt engineering experience to streamline AI-powered workflows?

Feature 1: Autocomplete for Prompts

I led the design of a Gemini-powered autocomplete feature to address the pain point of constant tool-switching during prompt authoring. The feature provides contextual suggestions as developers type, reducing the need to leave Genkit for optimization help.

KEY DESIGN DECISION

Entry point: Backslash trigger

I explored several trigger mechanisms: keyboard shortcuts, always-on suggestions, explicit buttons, and highlighting after typing. I landed on the backslash (\) as the trigger for a few reasons:

Familiar pattern from Slack, Notion, and other productivity tools developers already use

Intentional— doesn't interrupt developers who don't want suggestions

Easily accessible in the middle of workflows— no need to pause to activate the feature.

FINAL PROTOTYPE

Autocomplete for Prompt Writing

THE PIVOT

Shifting focus to evaluation

Midway, our client asked us to shift focus from prompting features to the evaluation phase. This wasn't a rejection of our earlier work; it reflected the product's evolving priorities and a recognition that evaluation was a bigger pain point than we'd initially scoped.

New design challenge: How might we make evaluation a seamless part of the Genkit workflow so developers can reliably measure LLM quality over time?

I went back to the research insights, particularly the feedback about workflow disruption and context-switching, and applied them to this new problem space.

Feature 2: Evaluation panel in prompt runner

User research revealed slow feedback loops as a major frustration. After mapping the current workflow, I saw why: developers were forced to leave their prompt and set up a full dataset just to test one iteration.

Current workflow (8 steps)

Proposed workflow with evaluation panel (5 steps)

My design, an evaluation panel directly accessible in the prompt runner page, removes that barrier to support quick, single-input evaluations within the natural iteration flow.

KEY DESIGN DECISIONS

Metric selection: Dropdown over checkboxes

My initial approach didn't account for scalability— I hadn't factored in that a teammate was designing a feature for custom evaluators, meaning the metrics list could grow indefinitely.

Checkboxes

All custom metric options visible at once

Overwhelms the UI as more metrics are added

Unclear what other metrics will be evaluated

Dropdown

Scales gracefully as more evaluators are added

Consolidates options into a compact, scrollable view

Also shows default metrics upfront so users know what to expect

Results display: Summary metrics over tables

I went through several iterations on how to display evaluation results within the panel:

Table Within Panel

Visually dense and overwhelming

Competed with the prompt editor for attention

Forces developers to parse detailed data

Table in Modal Dialog

Cleaner separation from the prompt editor

Felt disconnected from the workflow

Extra effort to understand new view breaks momentum

User testing clarified what developers actually needed.

They wanted key metrics first and foremost; quick signals to inform their next iteration. Detailed tables matter for deep debugging, not fast iteration cycles.

This insight led me to rethink not just the UI, but the entire flow:

Summary with Link to Table

Surfaces key metrics immediately without overwhelming the panel

Color-coded scores help developers scan results at a glance

"View inputs and outputs" links to the existing Datasets page for detailed analysis when needed, keeping complex tables out of the panel

Final workflow (Summary first)

Cross-feature coordination: These summary metrics appear across multiple features my team designed (dataset evaluation page, evaluation history). I coordinated with teammates to ensure we were using consistent patterns, and learned to communicate proactively when I updated shared components.

FINAL PROTOTYPE

Evaluation Panel in Prompt Runner

REFLECTIONS

What I learned…

1. Navigating an unfamiliar domain.

I'm not a developer, and before this project, I had no exposure to AI development workflows. Understanding how developers iterate on prompts and evaluate outputs required learning fast; I immersed myself in research, ran user interviews, and asked a lot of questions. This project taught me how to design for complex technical domains I don't initially understand, and gave me the confidence to do it again.

2. Staying flexible through pivots.

When the client shifted focus from prompting to evaluation, I initially felt the earlier work was wasted. But the research insights translated directly— the evaluation panel design is grounded in the same pain points I'd identified early on. Pivots aren't restarts; they're redirections.

3. Designing in an unsettled space.

AI-development tools don't have decades of established patterns to draw from; this space is still new enough that we're shaping what feels intuitive. Decisions were based on limited user testing and adjacent products, like the backslash entry point. Given more time and participants, I'd validate these patterns further.

What I'd do differently…

Establish a shared component library from day one, rather than reconciling after creating components
Run more usability tests throughout iteration, not just at the end of design phases

What's next?

I've worked with clients before, but never at this level of complexity! I learned how business priorities shape design decisions, how to present work to stakeholders, and how to support other designers while owning my own features.

The experience inspired me to take on a Project Manager role for Design Consulting at Cornell in Spring 2026. I'm excited to apply what I learned about client collaboration and navigating ambiguity to leading projects of my own.

Team photo from Google NYC office visit!

see more

cu reviews admin dashboard

CASE STUDY

overhauling the internal system of a course review website.

cu reviews admin dashboard

CASE STUDY

overhauling the internal system of a course review website.

cu reviews admin dashboard

CASE STUDY

overhauling the internal system of a course review website.

reduse

CASE STUDY

empowering students to resell within their community.

reduse

CASE STUDY

empowering students to resell within their community.

reduse

CASE STUDY

empowering students to resell within their community.

Google Genkit

Streamlining prompt refinement and evaluations in AI development workflows.

Role

Product Designer

Timeline

September 2025 - December 2025

Team

2 Project Managers,6 Product Designers

Toolkit

Figma, Figjam

OVERVIEW

THE CORE PROBLEM

Developers were frustrated by fragmented workflows.

MY IMPACT

Autocomplete for Prompt Writing

Evaluation Panel in Prompt Runner

COMPETITIVE ANALYSIS

Understanding the AI development landscape

1.

AI-assisted prompt optimization is becoming expected.

2.

Evaluation tooling is a competitive expectation.

3.

Model agnosticism is valuable.

4.

Simplicity wins adoption.

USER RESEARCH

Interviewing 20 AI developers

1.

Prompt refinement often happens outside the tool.

2.

Iteration cycles are slow and manual.

3.

Evaluation is neccessary, but it's not structured or integrated.

USER PERSONAS

Who are we designing for?

JOURNEY MAPPING

Piecing together the developer journey

DESIGN PROCESS

Ideation

Feature 1: Autocomplete for Prompts

KEY DESIGN DECISION

Entry point: Backslash trigger

FINAL PROTOTYPE

Autocomplete for Prompt Writing

THE PIVOT

Shifting focus to evaluation

Feature 2: Evaluation panel in prompt runner

Current workflow (8 steps)

Proposed workflow with evaluation panel (5 steps)

KEY DESIGN DECISIONS

Metric selection: Dropdown over checkboxes

Results display: Summary metrics over tables

User testing clarified what developers actually needed.

Final workflow (Summary first)

FINAL PROTOTYPE

Evaluation Panel in Prompt Runner

REFLECTIONS

What I learned…

What I'd do differently…

What's next?

see more

cu reviews admin dashboard

CASE STUDY

cu reviews admin dashboard

CASE STUDY

cu reviews admin dashboard

CASE STUDY

reduse

CASE STUDY

reduse

CASE STUDY

reduse

CASE STUDY

2 Project Managers,
6 Product Designers