// Insight
RAGBench: component metrics that give a RAG answer an audit trail
A RAG system that returns one number and no reasoning is a model-risk problem waiting to happen. RAGBench is the first benchmark I have seen that treats that problem head-on, with 100,000 examples across five industry domains, finance and legal among them. Its contribution is not another accuracy score, it is a way to ask where an answer came from. For a desk that has to defend an LLM-generated figure, that traceability is the whole point.
RAGBench scores a generated answer on four component metrics, grouped as TRACe. Context relevance asks how much of the retrieved text was actually on topic. Context utilization asks how much of it the model used. Adherence asks whether the answer is grounded in the retrieved text or invented. Completeness asks whether the answer used all the relevant evidence available.
Why component metrics beat a single grade
A pass-fail score tells you the answer was wrong. It does not tell you whether the retriever missed the evidence, the model ignored evidence it had, or the model invented something the evidence never supported. Those are three different failures with three different fixes. Low relevance points at the retriever. Low utilization points at the prompt or the reader ignoring what it was given. Low adherence points at a hallucination. You cannot debug or defend a pipeline that collapses those into one number. You certainly cannot tell a risk committee which part of the system to trust.
The finance coverage is what makes this concrete for a desk. RAGBench draws its financial examples from FinQA and TAT-QA, which are numerical-reasoning sets over financial documents and tables, exactly the kind of task where a confident wrong number does real damage. Scoring a financial RAG system on these four axes tells you whether a wrong answer came from bad retrieval or bad reasoning, which is the first question any model-risk review will ask.
The cheap classifier beats the expensive judge
The second finding is the practical one. A small finetuned model, DeBERTa, detects hallucinations far better than a zero-shot LLM judge, with adherence AUROC in the 0.80 to 0.87 range against 0.51 to 0.65 for GPT-3.5. The LLM judge, scored as a coin-flip-plus detector, is barely better than guessing on some domains. The purpose-built classifier, trained on the task, is genuinely useful.
That is a familiar lesson for a quant. A focused model fitted to one well-defined task usually outperforms a large general one asked to judge itself. It costs a fraction as much to run on every answer. Using a frontier model as its own grader is both expensive and weak. A small, dedicated adherence classifier is cheap enough to run on every single generation, which is the only way an audit trail actually covers the whole pipeline rather than a sampled few.
How I would use it
As the evaluation layer under any RAG system that touches a research claim. Score adherence on every generated answer with a cheap finetuned classifier, and gate the low-adherence ones for human review before they reach a memo. Track context relevance over time to know whether a retrieval problem or a generation problem is creeping in. Keep the per-answer scores, because they are the evidence a model-risk review will ask for.
The audit trail this produces is exactly what that review wants to see: a measured, per-answer score that says how grounded each output was and why. That is a stronger position than a single accuracy number, because it lets you point at the failing component and fix it, rather than relabel the whole system as untrustworthy.
Score a RAG answer on relevance, utilization, adherence, and completeness. A wrong figure then traces to the step that produced it, before anyone trusts it.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.