Skip to content
Tim Frenzel

// Insight

Why generic embeddings cap your financial RAG

3 min read
embeddingsretrievalfinance-NLP

The silent ceiling on a financial RAG system is the embedding model. Most teams never touch it. BAM embeddings, from the EMNLP 2024 paper with the memorable title about finance being a jungle, show how much that default costs. Fine-tuned on finance text, they reach Recall@1 of 62.8%, against 39.2% for the best general-purpose embedding model. That is the gap between finding the right passage two times in three and barely better than a coin flip.

Finance retrieval, Recall@1 (%)
Finance-tuned (BAM)62.8Best general-purpose embedding39.2

The model was trained on 14.3 million finance query-passage pairs. The payoff shows downstream. On FinanceBench, the benchmark where retrieval over filings mostly fails, swapping in the finance-tuned embeddings lifts answer accuracy by 8 points. The largest gains come on exactly the queries a desk cares about: date-specific, company-specific, and forward-looking ones.

Why generic embeddings struggle on finance

A general embedding model is trained to place semantically similar text near each other. That is the wrong objective for financial retrieval, where the distinctions that matter are precise rather than semantic. “Revenue in fiscal 2023” and “revenue in fiscal 2022” are semantically almost identical and financially completely different. A model tuned on general text treats them as neighbors. A model tuned on finance learns that the year is the load-bearing token.

The same is true for entities and forward-looking language. A generic model blurs two companies with similar business descriptions. It misses the difference between what a filing reports and what it projects. Finance is full of these near-identical-but-critical distinctions. An embedding space that does not encode them hands the retriever a blurred map. The 39.2% Recall@1 is what that blur costs.

This connects to the rest of the retrieval stack. Contextual Retrieval fixes the chunk that lost its context. Finance-tuned embeddings fix the space those chunks are mapped into. They are complementary upgrades, both aimed at the same target: getting the right evidence in front of a model that can already use it.

The cheapest upgrade in the stack

Put the economics plainly. An embedding swap is among the cheapest changes you can make to a retrieval system. You re-embed your corpus once and point the retriever at the new vectors. The rest of the pipeline is untouched. Against that near-trivial cost sits a Recall@1 jump from 39.2% to 62.8% on the benchmark, plus an 8-point lift in downstream answer accuracy on filings. Few upgrades in a RAG stack offer that ratio of payoff to effort. The reason teams skip it is not cost. It is that the embedding model is invisible, a default chosen once and never revisited, while attention goes to the model and the prompt. The lesson is to treat the embedding choice as a first-class decision on financial text, and to measure it the way you would measure any other component on the critical path of a wrong answer.

How I would use it

As one of the first things to test on any financial retrieval system, because the cost is low and the leverage is high. Swapping an embedding model is a small change with no impact on the rest of the pipeline. Measure Recall@1 on a set of your own questions before and after. If a finance-tuned model moves it the way it moved the benchmark, you have bought a large accuracy gain for almost no engineering.

The honest caveat is that a published embedding model is trained on someone else’s finance corpus, which may not match yours. The gains on your documents could be larger or smaller than the paper’s. That is the reason to measure rather than assume. The direction is clear, though: for financial text, a domain-tuned embedding model is the cheap, high-leverage upgrade most retrieval systems are leaving on the table.

The embedding model is the silent ceiling on financial retrieval. A finance-tuned one nearly doubles Recall@1 over a generic model, for the price of a one-line swap.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.