// Insight

Uncertainty heads: the confidence score an LLM never gave you

May 17, 20257 min read

uncertaintyhallucinationcalibrationvalidation

The thing a language model never hands you is the thing a risk function most needs: a trustworthy number for how sure it is. The model emits a fluent answer with no honest signal of whether the answer is grounded or invented, and on a regulated desk that gap is where the trouble starts. This paper supplies the missing primitive. A small supervised head, bolted onto a frozen model and reading its internal attention, detects hallucinations far better than the usual confidence proxies, and does so without retraining the model or changing a word of its output.

What the head is

The design is deliberately modest. The modesty is the point. You leave the base model frozen. You do not fine-tune it, you do not change its outputs, you do not pay to retrain anything large. Instead you attach a lightweight auxiliary head, the UQ head, that reads features from the model’s attention maps while it generates and produces a confidence score for each claim the model makes. The head is supervised: it is trained on examples labeled as grounded or hallucinated, learning the internal signature of a model that is making something up.

That last part is what separates this from the standard approaches. The common ways to estimate an LLM’s confidence are unsupervised and read the output: the probability the model assigned to its tokens, the perplexity of the generation, the entropy across the vocabulary. They are cheap and weak, because a model can be fluently, high-probability wrong. The UQ head instead looks inside, at how attention is distributed while the claim is formed, on the premise that a fabrication leaves a different internal trace than a grounded statement. The results say that premise holds.

The result

Claim-level hallucination detection, Mistral-7B (PR-AUC)

On the in-domain Biographies set, the trained UQ head reaches 0.66 PR-AUC, 16 points above the best unsupervised signal and well above raw probability, entropy, or perplexity. Reading the model's attention beats reading its output probabilities.

The gap is not marginal. On the in-domain set, the UQ head reaches 0.66 PR-AUC against 0.50 for the best unsupervised method, a 16-point lead, and roughly doubles the score of reading raw perplexity at 0.36. The output-probability proxies that most teams reach for first cluster in the low 0.40s. The lesson is blunt: the model’s own output probabilities are a poor confidence signal, while a small head trained to read its attention is a much better one. The signal that a model is hallucinating was there all along, in the internal computation. A cheap supervised probe extracts it where the output never revealed it.

The head also travels. The authors pre-train heads for several model families, Mistral, Llama, and Gemma 2, and report that the detection generalizes across languages as well. That portability matters for the practical case, because it means the head is a reusable component rather than a one-off tuned to a single deployment. The cross-language numbers make the point concretely.

UHead detection holds across languages (PR-AUC)

UHead PR-AUC for a Gemma 2 head detecting hallucinations in biographies across four languages: English, Russian, Chinese, German. The head is trained with English supervision, yet the signal transfers, and Russian at 0.58 and Chinese at 0.54 actually score above in-domain English at 0.46. A confidence probe that survives a change of language is one you can reuse across a multilingual book of business instead of rebuilding it per market.

The gate it buys you

A confidence score is only useful if it changes a decision. The decision it enables is the one a regulated desk needs most: abstain or escalate.

A confidence gate for an LLM-derived figure

The head bolts onto the model without retraining it or touching its output. It adds the one thing the base model never gave you: a per-claim confidence you can threshold into an abstain-or-escalate decision.

Walk the loop. The model produces an answer. The head scores each claim in it. A threshold on that score sorts the claims into two piles. The confident ones flow through to the report. The doubtful ones route to a human. The threshold is the policy knob: set it conservatively for a client-facing note where a wrong number is costly, set it looser for an internal scratchpad where speed matters more. Either way, every figure arrives with a calibrated flag rather than as a bare assertion. The desk spends its scarce human review on the claims most likely to be wrong.

This is exactly the architecture an LLM-derived figure needs before it reaches a risk report. The pattern recurs across the work I trust: an extraction pipeline gates a number on a verification result. A sell-side analyst model is dangerous precisely because its confident narrative carries no marker of where the numbers underneath are shaky. The UQ head is the general-purpose version of that gate. Where you cannot recompute the answer in code, you can at least attach a confidence and route on it, which converts an unmarked stream of maybe-wrong outputs into a triaged one.

The honest limits

Two cautions keep this proportionate. The first is that the head is supervised. It needs labeled examples of grounded and hallucinated output to train on, where the quality of the gate is the quality of that labeling. For a domain where you can assemble such labels, financial extraction with a verifiable ground truth being a good candidate, this is buildable. For a domain where you cannot say cleanly what counts as a hallucination, it is harder. The head can only be as calibrated as the supervision behind it.

The second is calibration drift, which the numbers state plainly. The 0.66 PR-AUC is in-domain. Out of domain the head still beats its baselines, at 0.47 on one transfer set and 0.40 on another, and those are real and useful margins. They are also a long way below 0.66. A head trained on one kind of text degrades on another, which means the threshold you calibrated on yesterday’s data is not guaranteed to hold on tomorrow’s. The discipline this demands is the same one any model-risk function already knows: monitor the calibration in production, recalibrate on fresh labeled data, and never treat a confidence score as a fixed property of the model.

And the obvious caveat: a detector is not a fixer. The head tells you a claim is probably wrong. It does not make the claim right. Its value is entirely in triage, in deciding which outputs to trust and which to send to a person. A 0.66 PR-AUC means the triage is good rather than perfect. You are buying a much better filter, not a guarantee.

The verdict

For model-risk governance, this is the kind of component I want in the stack: small, inspectable, model-agnostic, and aimed at the exact decision a regulated workflow turns on. It does not promise to stop the model hallucinating. It promises something more useful in practice, which is a calibrated signal of when it probably has, sitting between the model and the report where a human can act on it. The output probabilities most teams rely on for that signal are, the numbers show, close to useless for it. A trained head reading the model’s attention is the upgrade, cheap enough that the only real cost is assembling the labels to train it. On a desk that has to defend every figure it ships, that is a trade worth making.

An LLM’s output probabilities are a weak confidence signal. A small supervised head reading its attention reaches 0.66 PR-AUC at spotting hallucinations, far above perplexity or max-probability, and bolts on without retraining the model. It buys the abstain-or-escalate gate a regulated desk needs, as long as you monitor it for calibration drift.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →