// Insight

DeepSeek-Prover-V2: when correctness is machine-checkable

May 6, 20256 min read

formal-verificationreasoningreinforcement-learning

Most of what an LLM produces is plausible. A formal proof is different. It is either correct, or the checker rejects it. DeepSeek-Prover-V2 proves mathematical theorems in Lean 4, a language where a machine verifies every step. The interesting part for a quant is not the theorems, it is the template: subgoal decomposition plus a reward that cannot be gamed, because a proof checker decides it.

The method has two ideas worth taking. The first is recursive subgoal decomposition. A hard theorem is broken by DeepSeek-V3 into a chain of subgoals, each easier to prove. The subgoal proofs are then assembled into the whole, the same divide-and-conquer a human prover uses, made explicit and automatic. The second is the reward. Each candidate proof is run through the Lean 4 checker, which accepts it or does not. That binary, ungameable verdict is the reinforcement-learning signal. There is no reward model to fool, no judge to charm. The proof works or it does not.

The subgoal idea is the transferable one. It is worth dwelling on. Hard problems are hard partly because they are large. Breaking a theorem into subgoals turns one intractable proof into several tractable ones, each of which the model can attempt and the checker can verify independently. The decomposition is recursive: a subgoal too hard to prove directly is itself decomposed. This is how a human mathematician works, and how a good engineer attacks a large system, and seeing it formalized into a training loop is the part a quant should file away.

DeepSeek-Prover-V2: decompose, prove, verify

A hard theorem is split into subgoals; each formal proof is checked by Lean, giving a reward that cannot be gamed. Correctness is verified, not estimated.

The results

The numbers say the approach works on the benchmark it targets.

DeepSeek-Prover-V2-671B, % of problems solved

The 671B model solves 88.9% of miniF2F, the standard formal-proving benchmark, close to saturating it. On the much harder PutnamBench it solves 49 of 658, about 7%, and on a set of recent AIME problems it proves 6 of 15. The pattern is the familiar frontier shape: near-solved on the established benchmark, far from solved on the hardest problems. The contribution is the method that got there rather than a claim of solved mathematics.

This is the appeal of a checkable reward. It reminds me of technical scrutineering in motorsport. A car either passes inspection or it does not. The ride height, the plank wear, the fuel sample meet the rule or fail it, with no room for a steward’s opinion. A Lean proof is scrutineered the same way. It passes the checker or it does not. A model trained against that verdict cannot win by being persuasive. It can only win by being right.

Why this matters for a quant

Most LLM output in finance is plausible-looking and has to be checked by a human, which is the whole burden of the extraction and analyst-assist problems. The Prover-V2 template points at the cases where that burden lifts, because correctness is formally checkable. Where in quant work can a machine verify the answer? More places than you might assume. A set of risk constraints can be checked for consistency by a solver. Auto-generated pricing or risk code can be type-checked and run against known test cases. A derived identity can be verified symbolically. Anywhere the answer has a machine-checkable definition of correct, you can train and trust a model the way Prover-V2 does, against a verdict rather than a vibe.

Take the type-checked pricing code example seriously, because it is the most immediately usable. A model that writes a pricing or risk function produces code, and code can be type-checked, unit-tested, and run against analytic benchmarks where they exist: a put-call parity check, a known closed-form limit, a Greek that must carry a certain sign. Each of those is a machine-checkable reward in miniature. You are not proving a theorem. You are doing the same thing Prover-V2 does: optimizing the model against a verdict a machine returns rather than a human’s impression that the code looks right.

The gap between this and ordinary LLM use is worth stating plainly. Verifying after the fact catches an error once. A verified reward shapes the model so it makes fewer errors in the first place. The first is a safety net. The second changes what the model learns to produce. Where you can build the checker into the loop, you get a model trained toward correctness rather than merely policed toward it afterward, which is a stronger guarantee than any downstream review. For a function that has to defend its outputs, that distinction is not academic. A model that fails the check less often produces fewer wrong numbers to catch, which is the whole game.

A concrete instance lives in any quant codebase: the test suite. Code that must pass a battery of unit and property tests is already trained-and-checked in spirit, the suite playing Lean’s role as the arbiter. Prover-V2 shows how far that idea reaches when the checker is rich enough, all the way to a reward the model is optimized against rather than a gate it merely has to pass. The lesson for a desk is to make the checker as expressive as the problem allows, because the more a machine can verify, the more of the work you can safely automate.

How you would use this

The move is to find the verifiable core of a problem and wall it off. Most quant tasks are a mix: a judgment part that needs a human, plus a mechanical part that has a checkable answer. Auto-generating a pricing library, checking a set of risk limits for internal consistency, validating that a data transformation preserves an invariant, these have machine-checkable cores a model can optimize against directly. The judgment around them, whether the model is the right one, whether the limits are the right limits, stays with a person. The discipline Prover-V2 demonstrates is to separate the two cleanly, and to apply the verified-reward approach exactly where a machine can grade, no further.

The honest limits

Formal verification has a narrow domain. Most financial questions do not have a machine-checkable answer. Whether a thesis is sound, whether a price is fair, whether a risk is acceptable, these are judgments rather than theorems, and no checker settles them. The Prover-V2 approach applies to the formally-specifiable slice of the work, which is real but small. Building the formal specification is itself hard, often harder than the problem felt before you tried to make it precise. A wrong specification gives you a proof of the wrong thing.

Not everything should be formalized. The point is that the parts that can be are the parts where a model can run with far less human checking, and identifying those parts is worth doing deliberately. The instinct to find the machine-checkable core of a problem, and to wall it off from the parts that need judgment, is a useful one well beyond theorem proving.

DeepSeek-Prover-V2 trains against a reward a proof checker verifies. The model wins by being right rather than persuasive. The template fits the formally-checkable slice of quant work, risk-constraint consistency or type-checked pricing code, where a machine can grade the answer.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →