// Insight

LongRAG: bigger retrieval units, fewer detached numbers

July 30, 20243 min read

RAGlong-contextretrieval

Here is the LongRAG result worth sitting with. By making each retrieval unit about thirty times longer, roughly 4K tokens instead of a 100-word passage, LongRAG lifts answer recall at the top rank from 52% to 71% on Natural Questions. The retriever gets easier the moment you stop chopping documents into tiny pieces. That is a simple lever with a direct payoff for financial text.

The mechanism is straightforward. Standard RAG splits a corpus into short passages, which leaves the retriever sorting through 22 million units and forces it to pull the top 100 or 200 to find the answer. LongRAG groups related material into long units. The same corpus collapses to about 600,000 units. The system pulls only the top 4 to 8. A long-context reader, GPT-4o in the paper, handles the larger context that comes back. No training is required.

Answer recall at top rank, Natural Questions (%)

The recall jump is the whole argument. When the retriever only has to surface a handful of large units instead of a hundred small ones, the right evidence lands at the top far more often. Recall at the top rank is the ceiling on everything downstream: if the evidence never makes it into the context, no amount of model quality recovers it. Lifting that ceiling from 52% to 71% is a larger gain than most reranking tricks deliver. It costs nothing to train.

Why coarser units help on a filing

The failure mode LongRAG fixes is one a quant meets constantly. A number in a 10-K means nothing without its surrounding qualification. A revenue figure sits next to its segment, its reporting period, and footnotes that restate it. Chop the document into 100-word passages and the retriever can hand the model a figure stripped of the footnote that changes its meaning. A 4K-token unit keeps the number and its context together. The model sees the qualification at the same time as the value.

This is the same reason a careful analyst reads the surrounding note before trusting the line item. A restated figure, a one-off charge, a change in segment definition: each of these lives in text adjacent to the number, and each can flip the meaning of the value. Tiny chunks sever exactly those links. Larger units preserve them by construction, which is a more reliable fix than hoping a reranker re-assembles the context after the fact.

End-task accuracy stays competitive, with 62.7% exact match on Natural Questions against a strong trained baseline at 64.0%. The point is a coarser retrieval unit buys most of the accuracy with far less retriever strain and no training, which is the kind of cheap, robust win a research pipeline should prefer. The cost is context length: the reader has to handle the larger units. This approach rides on the long-context models that arrived this year.

How I would use it

As a default, not a special case. For retrieval over filings, transcripts, and research notes, start with large units and pull few of them, rather than many tiny passages reranked into a guess. Size the unit to the natural boundary of the document, a full note, a full section, a full transcript answer. The retrieved span is then something a human would also treat as one piece. Keep the reader long-context so the units fit.

The discipline carries a familiar lesson. The cheap structural change, larger units, often beats the elaborate one, a heavier reranker. It leaves you a system that is easier to reason about when a retrieved number turns out wrong, because there are fewer pieces to audit and each one carries its own context.

On financial filings, retrieve large units and few of them. A number then arrives carrying the footnote that qualifies it.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →