Skip to content
Tim Frenzel

// Insight

Does RL really incentivize reasoning? A caution for the backtest

10 min read
reinforcement-learningreasoningevaluationmodel-risk

The reasoning-model story has been a story about reinforcement learning. DeepSeek-R1, Kimi k1.5, and s1 all used RL to turn a base model into a reasoner. This paper asks the awkward question underneath that story: does the RL add reasoning the model did not have, or does it just make the model better at using what was already there? The answer is humbling. RL improves the first guess without expanding what the model can ultimately solve, and sampled enough times, the base model matches or beats its RL-tuned version.

The finding

The measurement is the elegant part. The standard way to score a reasoning model is pass@1: give it one attempt, check if it is right. You can also measure pass@k: give it k attempts and count it correct if any attempt succeeds. pass@k, introduced with the Codex code-generation work, measures something closer to capability, whether the model can find the answer at all given enough tries. The paper compares base models against their RL-tuned versions across both, on math, coding, and visual-reasoning benchmarks.

Omni-MATH, Qwen2.5-7B: pass@1 vs pass@256 (%)
RL (GRPO) pass@125.1Base pass@110.2RL (GRPO) pass@25668.3Base pass@25669.1
At pass@1, RL more than doubles the base model, 25.1 against 10.2. At pass@256, the base model edges ahead, 69.1 against 68.3. RL sharpens the first guess; it does not widen what the model can ever solve.

The result is consistent and counterintuitive. At pass@1, the RL-tuned model wins, as advertised. As k grows, the gap closes, and at large k the base model overtakes it. On the Minerva benchmark with a 32B model, the base model beats the RL-trained one by about 9% at k=128. The pattern holds across the benchmarks: the base model’s coverage, the set of problems it can solve given enough attempts, is broader than the RL model’s. RL did not add new solvable problems. It concentrated the model’s attempts on the ones it already could solve.

It is worth being precise about what pass@k at large k measures. If the base model gets a problem right even once in 256 tries, the reasoning path to that answer was already in it. pass@k traces the boundary of what the model can reach at all. The finding is that RL does not push that boundary outward. It moves probability mass around inside it. The authors check this across six RLVR algorithms and several model families. It is not an artifact of one training recipe. The shape is robust: RL ahead at k=1, the base model level or ahead at large k.

Reweighting, not expanding

This is the mechanism the paper lands on. It is worth stating precisely. RL with verifiable rewards takes the base model’s distribution over possible reasoning paths and reweights it, pushing probability toward the paths that lead to correct answers. That makes the first sample far more likely to be right, which is exactly what pass@1 rewards. It does not create paths the base model never had. The reasoning, in the paper’s words, originates from and is bounded by the base model.

Reweight versus expand
Base model's reasoning pathsRL: upweight the paths that already work, narrower and sharperDistillation: import new paths from a stronger teacher, genuinely broaderTwo different changes to the model
RL concentrates probability on correct paths the base model already had. Distillation can add paths the base model lacked. Only one of the two raises the ceiling.

The contrast with distillation is the constructive half of the finding. Distilling a stronger model’s reasoning into a weaker one can introduce genuinely new paths, raising the ceiling rather than sharpening the aim. That matches what R1’s distilled models showed: the small models gained reasoning they did not have, because it was imported from a bigger one. RL sharpens a model against its own ceiling. Distillation can move the ceiling.

Why can RL only reweight?

The reason is structural. It follows from how the training works. RL with verifiable rewards learns from the model’s own samples. It generates attempts, keeps the ones that score, and upweights them. A reasoning path the model never samples, because it was never in its repertoire, cannot be rewarded, because it never appears to reinforce. The method amplifies what the base model already produces, however rarely. That is why the ceiling is fixed by the base model: RL works entirely within the distribution it started from. Sharpening a distribution is not widening it, and only widening adds genuinely new capability. This also predicts a side effect the paper observes. The RL model’s outputs become less diverse, because probability has been concentrated, which is exactly why its pass@k stops improving and eventually falls behind.

What this means for the reasoning-model story

It would be easy to over-read this as deflating R1, Kimi, and s1. That would be a mistake. The distinction matters. pass@1 is what you actually deploy. A model that is right on the first try far more often is genuinely more useful, whatever the pass@k story says, because in production you usually get one shot rather than 128. RL earns its place as an efficiency. It makes the model’s existing capability reliably accessible. What the finding corrects is the inflation around it, the claim that RL is teaching models to reason in ways they never could before. It is not. It is making them reliably use reasoning they already had.

There is a deeper reason pass@1 is the right metric for most deployment, which keeps this from being an academic distinction. In production you rarely sample a model 128 times and take the best. You ask once and act on the answer, or at most sample a handful and vote. The whole value of a model is how good its typical attempt is. That is exactly what RL improves. The base model’s broader coverage at k=256 is real. It is also mostly unreachable in practice, because you cannot afford 256 tries on every query and you have no oracle to pick the winner among them. So RL converts a latent, unreachable capability into a usable one. That is genuinely valuable. It is just not the same as creating capability that was never there.

The quant translation

For a quant, this maps onto a familiar and expensive mistake. A backtest that improves after you add an overlay can mean one of two things. It can be genuine new edge, or it can be the same underlying signal reweighted to look smoother, variance reduction dressed up as alpha. Telling the two apart is most of the job.

The quant translation
An RL-tuned agent's better backtestAsk: new edge, or a sharper draw from the same distribution?Variance reduction dressed as alphaA genuinely new signalTest which one before you believe it
A higher pass@1 from RL is like a smoother equity curve from a fitted overlay. It can be real improvement, or the same capability reweighted. The discipline is telling them apart.

An RL-tuned trading agent’s better backtest deserves exactly that scrutiny. If RL mostly reweights the policy toward what already worked in-sample, the improvement is the pass@1 illusion in another costume: a sharper draw from the same distribution rather than a wider one. The test is the same one this paper applied to reasoning. Does the tuned agent actually do something the base policy could not, or is it just more confident about the same bets? A backtest alone cannot tell you. An honest out-of-sample comparison, base policy against RL-tuned, can.

Make it concrete with a familiar pattern. A trading policy is trained, then RL-tuned. The tuned version posts a higher backtest Sharpe. The tempting read is that RL found new edge. The pass@k lesson says check the other possibility first: that RL concentrated the policy on the trades that happened to work in the sample, lifting the in-sample Sharpe the way it lifts pass@1, while the underlying opportunity set, the policy’s coverage, did not grow at all. The two are indistinguishable in-sample by construction. They come apart out of sample, where reweighting toward yesterday’s winners tends to disappoint and genuine edge tends to persist. That is the test. It is the one a backtest cannot run for you.

What this does not say

The finding is narrow on purpose, and over-claiming it is its own mistake. It does not say RL is useless. The opposite: making capability reliably accessible at the first attempt is most of what deployment needs. It does not say the base model is secretly as good, because at pass@1, the metric that matters in production, the RL model genuinely wins. And it does not say no method can expand reasoning. Distillation can, and future methods may.

So is RLVR a dead end? No. The nuance matters for planning. The result is about RL applied to a fixed base model with verifiable rewards, in the form practiced through early 2025. It says that recipe sharpens rather than expands. It does not bound what a richer reward or a co-evolving base-and-RL loop might achieve, and research is pushing on exactly those. For now the practical reading is conservative. When you see an RL-tuned model or agent, assume the gain is sharpened access to existing capability until an out-of-sample test shows otherwise. That assumption is right far more often than the breathless one, and being wrong in the cautious direction costs far less.

What to actually conclude

Three takeaways. First, RL with verifiable rewards is real and useful, as an efficiency that makes a model’s capability reliably accessible on the first attempt. Treat it as that rather than as a capability-creator. Second, if you want to raise the ceiling, the lever is the base model or distillation from a stronger one, rather than more RL on the same base. Third, and most important for a desk, apply the same skepticism to any RL-tuned agent whose backtest improved. Ask whether the gain is new edge or reweighted old edge, and demand the out-of-sample evidence that separates them.

For a desk, that becomes a simple rule for vetting any tuned model or agent someone brings you. Ask two questions. Did the base model ever solve this, given enough tries, which tells you whether the capability was there to begin with? And does the improvement survive out of sample, or only in the data it was tuned on? If the gain is the base model’s capability made reliable, deploy it and value it honestly, because reliable access to real capability is worth paying for. If it is the same capability reweighted to flatter a backtest, you have found a wrapper rather than an edge. The out-of-sample test is what tells the two apart before capital does.

The reasoning models are still a genuine advance. This paper insists only that we credit the right thing for it, which is the same discipline a quant owes every backtest that suddenly looks better.

The habit this rewards is worth naming. Be precise about what a technique changes. RL changes accessibility. Distillation changes capability. Scale changes both. Conflating them is how a field talks itself into believing a tuning trick is a breakthrough, and how a desk talks itself into believing a reweighted backtest is a new strategy. The same scalpel that separates those for a reasoning model separates them for a signal. Wield it and you stop paying breakthrough prices for efficiency gains, in models and in strategies alike. It is among the cheapest forms of edge a desk has: the discipline of crediting the right cause, a habit too often left on the table while everyone chases the next tuning trick.

RL makes a reasoning model better at the first try without expanding what it can ultimately solve, the way an overlay can smooth a backtest without adding edge. Credit RL for efficiency, look to the base model or distillation for capability, and treat any RL-tuned agent’s backtest with the same suspicion.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.