// Insight

Granular metric extraction from filings: traceability and verification beyond summarization

March 9, 202510 min read

extractionfinancial-documentsverificationmodel-risk

The request is always the same. Read the 10-K and give me the number. It sounds simple. It is the hard problem in financial AI. Extraction, not summarization, is where models fail. The benchmarks say they fail it more than half the time. A model that writes a fluent summary of a filing can still pull the wrong revenue figure. A wrong number in a memo is worse than no number at all. This is the build guide for extraction you can defend: a source attached to every figure, with a verification gate that catches the bad ones before they reach a decision.

Why is extraction harder than summarization?

Because the two tasks forgive different things. A summary has many acceptable forms. If the model emphasizes the right themes and reads sensibly, a little looseness in the wording costs nothing, and fluency papers over small gaps. Extraction has exactly one correct answer. Revenue for the cloud segment in fiscal 2023 is a single value, with a specific entity, period, and line item attached, and close is simply wrong. There is no partial credit for a number.

Why extraction is the hard half

Summarization tolerates approximation; extraction does not. Revenue for the cloud segment in fiscal 2023 is one value, and close is wrong. That is why a model that summarizes well can still fail extraction.

The numbers behind this are sobering. FinanceQA, a benchmark built to mirror real on-the-job analysis at hedge funds and investment banks, found that current models fail roughly 60% of its realistic tasks. The failures cluster in exactly the work that matters: spreading metrics by hand, applying accounting and valuation conventions, and reasoning under incomplete information where the analyst has to generate an assumption. These are not lookup questions. They are the multi-step, judgment-laden tasks that make up most of the job. That is where the model folds.

It is worth being specific about what those tasks look like, because the difficulty is concrete. Spreading a metric means pulling the right line items from the statements and combining them correctly, the kind of work where one misread cell or one wrong sign quietly corrupts the result. Applying a convention means knowing which definition of operating income the question intends, or how a lease should be treated, judgments a junior analyst learns and a model often guesses. Working under incomplete information means generating a reasonable assumption when the filing does not state a figure, and being explicit that you did. None of these is retrieval. All of them are where the 60% failures live, invisible in a fluent answer that happens to be wrong.

What do the benchmarks actually say?

Two findings frame the problem. The first is that even finding the number is hard. FinanceBench showed that GPT-4-Turbo, left to retrieve from a filing on its own, answered 19% of questions correctly, against 85% when handed the right page.

FinanceBench: GPT-4-Turbo, finding vs reading the number (%)

The model can read a number off the right page. It struggles to locate that page in the first place. That is the retrieval half of the problem, covered in the field guide to RAG on filings. Extraction sits on top of retrieval: even with the right page in hand, the model has to identify the correct line item, read the value without transposing a digit, attach the right period, and, for anything derived, do the arithmetic correctly. FinanceQA’s 60% failure rate is what happens when all of those have to go right at once.

The second finding is that the failures are not where you would hope. As the work on LLMs as analysts showed, a model’s numerical reasoning is uneven and its confident narrative is no guarantee the number underneath is sound. The output that reads most authoritative is not the one most likely to be right, which is the precise trap an extraction pipeline has to defend against.

The build guide: extraction you can defend

The architecture that survives a model-risk review has three properties the naive version lacks: locating with a citation, verifying the arithmetic outside the model, and gating what it cannot stand behind.

Extraction you can defend: locate, extract, verify, gate

Every number carries its source and a verification result. Arithmetic is done in code, not by the model. A figure that fails a check is held back before it reaches a memo.

Walk the stages. Locating with a citation means the system returns the number together with the page and table it came from, which a human can confirm in seconds. Extracting means reading the specific value, with the entity and period bound to it, the context that contextual chunking exists to preserve. Verifying means checking the number before trusting it. Gating means the system is allowed to say it is unsure and route the case to a person, rather than emitting a confident guess.

What a defensible extraction returns

The shape of the output is part of the design rather than an afterthought. A bare number is not enough. Each extracted figure should arrive as a small record: the value, the source it came from (document, page, table), the method (read directly, or derived, with the calculation if derived), a verification result, a confidence. That record is what makes the rest of the system work. Traceability is the source field. Verification is the checks field. Gating reads the confidence and the checks and decides whether the figure passes or goes to a human. A pipeline that returns numbers without this scaffolding cannot be audited, cannot be gated, and cannot be trusted with anything that matters. The discipline is to treat every figure as a claim with its evidence attached, never as a bare value floating in a memo.

Why traceability is non-negotiable

A number without a source is unusable in a regulated setting, however right it happens to be. The reason is not pedantry. It is that nobody can act on a figure they cannot check, and on a desk that has to defend its work, an unverifiable number is a liability whether or not it is correct. Traceability turns the system from an oracle into an assistant. The model proposes a number and points to where it found it. A human confirms the ones that matter in seconds rather than re-deriving them from scratch.

This also changes the failure mode for the better. When every figure carries its citation, a wrong extraction is a wrong pointer you can see and correct, rather than a plausible number floating free in a memo with no way to trace it back. The discipline that feels like overhead in a demo is the thing that makes the system usable in production, where the cost of an unsourced wrong number is measured in more than embarrassment.

Why you never trust the model’s arithmetic

The single most important rule in the verification stage: do the math in code rather than in the model. Language models are unreliable at multi-step arithmetic, a weakness FrontierMath measures starkly and one that shows up the moment a task requires spreading or deriving a metric. The fix is old and well-proven. Program-aided language models, introduced in 2022, have the model write the calculation as code and let an interpreter execute it, which moves the arithmetic onto a machine that does not make slips. For financial extraction that is the right division of labor. The model reads and locates. Code computes and checks.

Verification is more than recomputation, though that is the core. Cross-check a derived figure against a stated one where the filing provides both. Apply sanity checks on units and signs, because a margin of 1400% or a negative share count is a parsing error announcing itself. Reconcile totals against their components. Each check is cheap, and together they catch the class of error that a fluent model produces most confidently: the number that looks exactly right and is quietly wrong.

A worked example shows why the layers earn their place. Ask for a company’s operating margin. The model locates operating income and revenue, reads both, and divides. Suppose it transposes a digit reading operating income, turning 1,240 into 1,420. It returns a margin of 14.8%, fluent and specific and wrong. A recompute in code from the two extracted figures does not catch this, because it faithfully divides the wrong input. A cross-check does. The same margin reconciled against a different stated figure, or against the prior year, disagrees, which flags the figure for review. That is why verification is layered. Recomputation catches arithmetic slips, cross-checks catch extraction slips, and sanity bounds catch the gross errors. No single check is enough. Together they cover the ways a number goes wrong.

One more verification layer is worth building: consistency across the document. A filing states many figures that must agree, subtotals with their components, a cash-flow line with its balance-sheet counterpart, a segment total with the consolidated number. A model that extracts several related figures can check them against each other. A disagreement is a strong signal that one of them is wrong. This is free evidence the document hands you, and most pipelines ignore it. Reconciliation across the filing catches errors no single-figure check can see, the closest thing extraction has to a built-in audit.

Why not just wait for a better model?

The tempting response to a 60% failure rate is to wait for the next model, which will surely fail less. It will. It will not solve the problem, for two reasons. First, the failures that matter are not random noise that scale sands away. They are systematic, the model confidently producing a wrong number on exactly the multi-step tasks it finds hard. A better model fails the same way less often rather than differently. Second, and more important, you cannot tell which answers the model got right without checking. A model that is 70% accurate instead of 40% is better. It still hands you a pile of numbers where some unknown subset is wrong, with no marker on the bad ones. For anything feeding a decision, an unmarked 30% error rate is as unusable as an unmarked 60% one. The architecture, traceability and verification and gating, is what turns an accuracy figure into a trustworthy output, and no accuracy short of perfect removes the need for it.

The bottom line

Extraction is the hard half of reading a filing. The benchmarks are blunt about it: current models fail the majority of realistic financial-analysis tasks. The answer is not a bigger model that fails slightly less often. It is an architecture that assumes the model will be wrong some of the time and is built to catch it: locate with a citation, extract with the entity and period bound in, verify the arithmetic in code, and gate what cannot be stood behind. Do that and an agent reading filings becomes a genuine accelerant, drafting the numbers a human confirms rather than a black box producing figures nobody can check. Skip it and you have built a fast way to put a wrong number in front of a client. The model is the easy part. The traceability and the verification are the work, the part that turns a model into something a desk can actually use.

Reading a filing is easy; pulling the right number is hard, and models fail it more often than not. Build for it: a source on every figure, the arithmetic done in code, a gate that holds back what the system cannot verify.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →