// Insight

FinanceBench: can RAG actually answer questions about a 10-K?

August 19, 20246 min read

RAGfinancial-QAbenchmarkfilings

Show a client one number before they greenlight a RAG project on filings. On FinanceBench, Patronus AI’s benchmark of 10,231 questions over real 10-Ks, 10-Qs, and earnings releases, GPT-4-Turbo wired to a vector store answered incorrectly or refused on 81% of a 150-question sample. The same model, handed the right pages, scored 85%. The model can do the work. The retrieval cannot find the evidence.

That gap is the most useful result in financial RAG. It points the blame precisely. Read the accuracy across the settings. The diagnosis writes itself.

FinanceBench: GPT-4-Turbo accuracy by setting (%)

Closed book, with no documents, the model gets 9%, which tells you these answers are not in its memory. With a shared vector store across all filings, it reaches 19%. Given the long-context version of the right document, it jumps to 79%. In the oracle setting, with the exact evidence pages provided, it hits 85%. The capability is there the whole time. What fails, over and over, is finding the right paragraph in a pile of near-identical filings.

Why retrieval is the hard part on filings

This is the result I wish every team building a filing assistant started from. The instinct is to reach for a bigger or smarter model. The data says the model is rarely the constraint. The constraint is that financial documents are a worst case for retrieval.

Three properties make them so. They are long: the right paragraph is a needle in a very large haystack. They are dense with boilerplate, which makes many passages look almost identical to a vector search ranking by surface similarity. And they are full of tables whose structure repeats across companies and periods, which means a query about one firm’s segment revenue retrieves a structurally identical table from the wrong firm or the wrong year. A vector search confidently returns a passage that looks right and belongs to the wrong segment, period, or entity. The answer the model then gives is fluent, specific, and wrong.

The task makes it harder still. Most FinanceBench questions need the model to retrieve a specific number and then do arithmetic on it. A retrieval miss does not produce a vague answer. It produces a precise answer built on the wrong figure, which is the single most dangerous output a financial system can generate, because it looks exactly like a right one. A human reviewer scanning outputs for obvious nonsense will wave it straight through, since nothing about it looks off except the value.

What the long-context result does and does not say

Long context scoring 79% is encouraging and a trap in equal measure. It proves the model can answer when the evidence is present, which is the half of the result that should make you optimistic. It is also impractical at scale. You cannot stuff every filing into context on every query, the latency is unworkable for a universe of names, and real filings overflow even large windows once you move past a single document. Long context is the proof that retrieval is the bottleneck. It is not the production fix.

The right way to read the four bars is as a decomposition. The closed-book number says the knowledge is not memorized, so retrieval is mandatory. The oracle number says the reasoning is already good enough. The model is not the problem. The vector-store number, sitting far below both long-context and oracle, is the size of the retrieval gap you have to close. On FinanceBench that gap is the difference between 19% and 85%, which is almost the entire task.

What closing the gap takes

The encouraging half of the result is that the gap is an engineering problem with known levers rather than a wall. Each lever attacks a specific way retrieval fails on filings.

Metadata on every chunk comes first. A chunk tagged with its company, filing type, period, and section lets the retriever filter before it ranks. A query about one firm’s 2022 segment revenue then never surfaces a structurally identical table from a different firm or year. Hybrid retrieval comes next. Pairing keyword search with vector search recovers the exact-match signal that pure embeddings lose, which matters when the right answer hinges on a specific entity name or a precise line-item label. Reranking tuned on financial queries reorders the shortlist with a model that has actually seen this kind of question. And contextual chunking, where each chunk is prefixed with a short description of where it sits in the document, stops a passage from losing the entity and period it belongs to.

None of these is exotic. Together they are the difference between the 19% a naive vector store scores and the 85% the model reaches once the right evidence is in front of it. The work is unglamorous. It is where the accuracy lives.

The point for a buyer is that these levers are testable. You can measure retrieval recall before and after each one on a set built from your own filings. The answer accuracy tracks it. A vendor who cannot show you that curve is selling you a model, when the thing that determines whether the system works is the retrieval stack around it.

How I would build this

Retrieval-first, and measured. On a model-risk governance review, the question I would ask of any filing assistant is not its model, it is its retrieval recall on a FinanceBench-style set built from your own documents. A vendor who answers with the name of their language model has not understood the problem. Three rules follow.

Invest in retrieval before you invest in the model, because the FinanceBench gap says that is where the accuracy lives. That means metadata on every chunk, hybrid keyword-and-vector search so an exact entity or period match is not lost to semantic similarity, and reranking tuned on financial queries. Measure retrieval and generation separately. A wrong answer should trace to the step that caused it. Then you know whether to fix the retriever or the prompt.

Gate every numerical answer behind a check that the cited figure actually appears on the cited page, because a precise wrong number is worse than a refusal. The cheapest, most reliable guardrail on a financial RAG system is a verifier that confirms the number in the answer is present in the retrieved evidence, and abstains when it is not. The model is good enough already. The job is getting the right evidence in front of it and proving that you did.

The standing lesson

FinanceBench is from late 2023. The retrieval techniques have improved since. The lesson it teaches has not. Whenever someone shows you a polished filing assistant, the question is not how good the model is. It is how often the system retrieves the exact evidence the answer depends on, measured on documents like yours. Until that number is high, a better model only makes the wrong answers more fluent.

On financial filings the model is rarely the bottleneck, retrieval is: FinanceBench shows the same model jumping from 19% to 85% once it is handed the right pages.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →