// Insight
FinSage: compliance-grade filing QA, built from unglamorous parts
The interesting thing about FinSage is not that it tops a benchmark. It is how it gets there. The system answers compliance questions over multi-modal financial filings, the kind where a retrieval miss is not an inconvenience but a violation waiting to happen, and beats the published baselines by a wide margin. None of the margin comes from a bigger or cleverer language model. It comes from chunk metadata, four retrieval paths instead of one, plus a reranker tuned to surface the chunks a regulator would care about. That is a lesson worth more than the leaderboard position.
Why compliance QA is the unforgiving case
Most retrieval-augmented generation tolerates a near miss. Ask a research assistant to summarize a company’s strategy and a slightly incomplete retrieval still produces a usable answer. Compliance QA does not work that way. The question is often whether a specific disclosure exists, whether a covenant was breached, whether a figure was reported where the rules require it. If the retriever fails to surface the one paragraph that answers the question, the system does not give a slightly worse answer. It gives a confidently wrong one, and in a compliance setting a confidently wrong answer is the expensive kind.
That raises the stakes on recall specifically. You can tolerate a retriever that pulls some irrelevant chunks, because the reader or the model can ignore them. You cannot tolerate a retriever that misses the relevant chunk, because nothing downstream can recover what was never retrieved. FinSage is built around that asymmetry, and its design reads as a catalogue of the unglamorous things you do when a miss is unacceptable.
The retrieval result
Start with the headline retrieval number, because it carries the argument. On the filing QA set, FinSage reaches 92.51% recall. The interesting part is what it beats, which is each of its own components run alone.
Read the bars and the thesis is right there. Dense retrieval alone, the default in most RAG stacks, recalls 81.94%. Sparse BM25 does slightly better at 84.52%, which is itself a useful reminder that the old keyword method still earns its place on documents full of exact terms and figures. Metadata-aware search and a fine-tuned HyDE query expansion each land near 85.7%. Every single path leaves more than one question in seven unanswerable. Run all four and merge the candidates, and recall jumps past 92%. The combination recovers the documents that any single method would have missed, because the methods miss different things.
The architecture
The full pipeline is worth walking, because every stage is doing recall insurance.
It begins before retrieval, in preprocessing. Filings are not clean text. They are text wrapped around tables, charts, and footnotes. A naive chunker shreds a table into nonsense. FinSage processes the multi-modal content and tags each chunk with metadata: which section it came from, what type of content it is, what period it covers. That metadata is not decoration. It becomes a retrieval path of its own and a filter the later stages can use, which is why metadata-aware search scores as well as it does in the chart.
Then the four-path retrieval, merged into one candidate set, exactly the combination the recall numbers reward. The query side adds HyDE, where the system drafts a hypothetical answer and retrieves against that rather than the bare question, a trick that helps when the question and the document use different vocabulary, which in finance they constantly do.
The reranker is the part to steal
The stage I would lift wholesale into a production system is the reranker. After retrieval pulls a broad candidate set, a reranker reorders it so the best chunks land on top, because the model downstream only reads the first few. FinSage fine-tunes its reranker with Direct Preference Optimization, training it on preferences that favor compliance-critical chunks. The effect is concrete: in the top-5 configuration, the domain-specialized reranker lifts precision from 5.6% to 18.5% over a general-purpose reranker.
A general reranker optimizes for topical relevance. A compliance reranker has to optimize for something narrower and more useful: the chunk that actually settles the regulatory question, which is often not the most topically central one.DPO is a clean way to teach that distinction, because it learns from pairs of better-and-worse rather than from hand-built labels. You collect examples where one chunk answers the compliance question and another merely discusses the topic, and tune the reranker to prefer the first. That is a transferable recipe. Any desk running RAG over a domain where some chunks matter far more than their topical relevance suggests, which describes most regulated work, can build the same preference-tuned reranker on top of whatever retriever it already has.
The recipe is cheaper to run than it sounds, because the preference pairs are mostly already lying around. Every answered compliance question is a labeled example: the chunk that held the answer is the preferred one, while the other chunks the retriever returned beside it are the rejected ones. The retriever’s own near-misses, past analyst work, and resolved tickets all become training pairs without a separate labeling project. The harder and more valuable pairs are the confusable ones, where a chunk discusses the right topic but stops short of the disclosure that actually answers the question. Those are precisely the cases a topical reranker mishandles, and mining the retriever’s high-scoring misses for them is where most of the gain lives.
There is a governance dividend too. A preference-tuned reranker is auditable in a way a bigger model is not, because the thing that shaped its behavior is a set of preference pairs a validator can read, sample, and challenge. When it promotes the wrong chunk, you can point to the pairs that taught it and add the counterexample, rather than shrugging at a black box you rented. For model-risk work that traceability is worth as much as the precision gain.
The model-risk reading
Set against the end-to-end accuracy, the picture sharpens. FinSage answers 49.66% of the QA set correctly under automated scoring and 57.05% under human review. The paper reports a 24.06-point accuracy gain over the best published baseline. Those absolute numbers are the honest part of the story. A system that answers a little over half of realistic compliance questions correctly is a strong research result and not yet an autonomous compliance officer. The point is not that FinSage solves the problem. It is that it moves the number a long way using parts a desk can actually assemble, and moves the retrieval recall, the thing a miss depends on, into the low nineties.
For anyone running model-risk governance, that distribution of effort is the takeaway. The instinct when a RAG system underperforms is to reach for a bigger model. FinSage is evidence that the cheaper and more durable wins are upstream of the model: in how you chunk and tag the documents, in running several retrieval paths so their blind spots do not overlap, and in a reranker that knows what your domain actually cares about. Those are auditable, testable components, each one something a validator can inspect and a desk can improve in isolation. A bigger model is a black box you rent. A four-path retriever with a preference-tuned reranker is an architecture you own and can defend, which in a regulated setting is most of the value.
The system has been deployed, serving more than 1,200 users by the paper’s account, which matters because it means these choices survived contact with real questions rather than only a benchmark. The benchmark tells you the design is good. The deployment tells you it is buildable. Both point at the same unglamorous conclusion: in compliance QA, the engineering around the model is where the reliability lives.
FinSage tops compliance filing QA without a bigger model. The recall comes from running four retrieval paths instead of one. The reliability comes from a DPO-tuned reranker that surfaces compliance-critical chunks, lifting top-5 precision from 5.6% to 18.5%. The reranker recipe is the part to steal.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.