Skip to content
Tim Frenzel

// Insight

GEPA: improving an agent by reading its traces, not its gradients

6 min read
prompt-optimizationDSPyevolutionary

The expensive way to make an LLM system better is to fine-tune it with reinforcement learning: run it thousands of times, score the runs, and nudge the weights toward the ones that scored well. GEPA proposes a cheaper path that is also, on its benchmarks, a better one. Instead of nudging weights from scalar rewards, it reads the system’s own execution traces, reflects on them in natural language to work out what went wrong, and edits the prompt accordingly. GEPA beats a reinforcement-learning baseline by up to 20% while using up to 35 times fewer rollouts, by treating each failed run as something to be understood rather than just scored.

How it works

The loop is the idea. GEPA runs the system on a task and collects the full trajectory: the reasoning steps, the tool calls, the intermediate outputs, the final answer. It then feeds that trace back to a language model and asks it to reflect, in words, on where the run went wrong and why. That reflection becomes a proposed edit to the prompt, which is tested empirically. The edits that help are kept, the ones that do not are discarded, then the process repeats.

GEPA: the reflective optimization loop
Run the system on a taskCollect the full trace: reasoning, tool calls, outputsReflect in natural language: what failed, and why?Propose a prompt edit that addresses the diagnosisTest it, keep what helps on the Pareto frontier of candidatesLoop, evolving the prompt
The signal is not a scalar reward. It is a written diagnosis of the trace, which carries far more information per run than a single number, which is why far fewer runs are needed.

The reason this is sample-efficient is the information content of the signal. A reinforcement-learning reward is one number per run: better or worse, by this much. A natural-language reflection on the same run is a paragraph: it failed because it misread the date format in step two, it confused the parent company with the subsidiary, it stopped before checking the second source. That diagnosis carries orders of magnitude more information than a scalar, and more information per run means fewer runs to converge. The 35x rollout reduction is not a trick. It is what you get when each trial teaches a sentence instead of a number.

A race engineer rather than a dyno sweep

The distinction that makes GEPA click for me is one I know from motorsport. There are two ways to improve a car. One is to sweep the setup space: try a thousand combinations of wing, camber, and tire pressure, lap each one, and keep whatever is fastest. The other is to bring the car in, read the telemetry with the engineer, and reason about why it is understeering into the slow corners, then change the one thing the data says to change. The first is reinforcement learning. The second is GEPA. A good race engineer does not randomly perturb the car and time the result. They read the trace of the lap, form a hypothesis about the cause, and make a directed change. It converges in a handful of runs because each run is diagnosed rather than merely scored. The diagnosis is what tells you what to change next. GEPA is that engineer, applied to a prompt.

The other piece worth naming is what GEPA keeps between rounds. Rather than greedily holding the single best prompt, it maintains a Pareto frontier of candidates, the ones that are best at different parts of the task. A prompt that nails the numeric questions and one that nails the textual ones are both kept, and GEPA synthesizes their complementary strengths instead of forcing a choice. That is what stops the search collapsing onto a prompt that is good on average and excellent at nothing.

The results

GEPA: reported gains over the method it replaces (%)
vs GRPO, best case20on AIME-2025 vs MIPROv212vs MIPROv2 (prompt optimizer)10vs GRPO (RL), average6
GEPA beats a reinforcement-learning baseline (GRPO) by 6% on average and up to 20%, and the leading prompt optimizer (MIPROv2) by over 10%, including a 12-point gain on AIME-2025. It does this with up to 35x fewer rollouts, which is the number that matters when labeled data and compute are both scarce.

Across six tasks, GEPA beats the GRPO reinforcement-learning baseline by 6% on average and as much as 20% at the high end, and beats MIPROv2, a strong prompt optimizer, by more than 10%, including a 12-point gain on the AIME-2025 math problems. The same reflective search also works as an inference-time strategy for optimizing code. The accuracy gains are real. The headline I would put on it is the 35x, because efficiency is what decides whether a method is usable on the data a desk actually has.

Why a desk should care

Here is the regime where this matters. Reinforcement-learning fine-tuning needs a lot of labeled examples and a lot of rollouts, and most financial-analysis tasks have neither. You are tuning a pipeline to spread a metric, classify a disclosure, or extract a figure, and your labeled set is a few dozen carefully checked examples, not the tens of thousands RL wants. In that regime RL is simply not an option, and GEPA’s sample efficiency is exactly what makes optimization feasible at all. When you have dozens of examples rather than thousands, a method that learns from a written diagnosis per run rather than a scalar reward per thousand runs is the difference between being able to tune the system and not.

There is a governance benefit that comes free with the approach. What GEPA produces is a prompt, in plain language, that you can read, diff, and put in front of a reviewer. An RL-fine-tuned model hides its improvement in weight updates nobody can inspect. A GEPA-optimized system hides nothing: the change is text, the reasoning behind it is text, and so is the trace that motivated it. For a desk that has to defend why its pipeline behaves the way it does, an optimizer whose entire output is auditable is worth more than a few points of benchmark score.

The honest limits

GEPA optimizes the prompt rather than the model. It cannot add a capability the base model lacks, in the same way that reinforcement learning sharpens rather than expands what a model can do. It moves the model’s existing ability to where the task needs it. It also leans on the quality of the reflection, which means a weak model writing the diagnoses produces weak edits, with the method inheriting whatever blind spots the reflecting model has. And it still needs an evaluation signal to test candidates against. The same discipline applies as everywhere else: a prompt tuned against a flawed metric is tuned to the flaw. None of that dents the core result. Reading the trace beats scoring the run, and on the small-data tasks that fill a research desk’s actual workload, reading the trace is often the only option that works.

GEPA improves an LLM system by reflecting on its execution traces in natural language and editing the prompt, beating a reinforcement-learning baseline by up to 20% with up to 35x fewer rollouts. A written diagnosis per run carries far more than a scalar reward, which is why it converges on the dozens of examples a desk actually has. The optimized prompt is plain text a reviewer can read.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.