// Insight

GPT as a sell-side analyst: where it beats the human, and where it folds

February 21, 20256 min read

LLM-evaluationequity-researchforecasting

The headline writes itself every quarter: AI is coming for the sell-side analyst. The research is more interesting than the headline. An LLM reads an earnings release much the way a human analyst does, but its numerical reasoning is inconsistent. The most useful finding is that you can tell, from observable features, when to trust its forecast and when not to. The move is not to replace the analyst. It is to know exactly where the model is calibrated and to gate it where it is not.

What did the study test?

The setup is direct. After a company reports earnings, can an LLM generate a forecast of the kind a sell-side analyst produces, and how does it compare to the humans? Earnings forecasting is a good test because it is concrete, scored against reality, and exactly the judgment-plus-numbers task analysts are paid for. The study goes further than a score, though. It looks at how the model arrives at its forecast, which is where the genuinely useful findings sit.

What it gets right, and where it folds

Two findings cut against the easy story. The first: the model’s narrative attention is consistent and human-like. When it reads a release, it focuses on the same earnings-relevant information a human analyst would. The model reads it the way a trained eye would, which is reassuring and a little surprising.

The second finding complicates the “LLMs are bad at math” reflex. Its numerical reasoning is inconsistent rather than uniformly weak. It varies substantially across contexts. On some it reasons through the numbers competently. On others it stumbles, in places you would not always predict. A flat claim that the model cannot do numbers is wrong. The truth is messier and more operationally important. It can do the numbers sometimes. The failures are scattered.

There is a third finding that punctures a comfortable assumption. Focusing on the narrative does not always produce a better forecast. Reading the story well and getting the number right are not the same skill. The model can do the first without the second, which is precisely the trap a naive deployment falls into: the output reads fluent and well-reasoned while the number underneath it is wrong.

Why narrative-strong but number-shaky?

The pattern is not an accident of this one study. A language model is built to predict text. It is extraordinary at the linguistic structure of analysis: the framing, the emphasis, the way a careful reader weights a guidance change against a one-time charge. Multi-step quantitative reasoning is a different operation. It is the one these models do least reliably, as a hard benchmark like FrontierMath shows starkly. So the split the study finds is the expected one once you take the architecture seriously: strong on the narrative it was trained to model, uneven on the arithmetic it was not. The value is in knowing that the two travel separately. A confident narrative is no guarantee the number is sound.

Picture the failure in practice. A company reports a quarter where revenue beat but margins compressed on a one-time cost. The model reads that narrative correctly. It flags the margin pressure, weighs the one-time nature, frames the outlook sensibly. Then it computes the implied next-quarter EPS and the arithmetic is subtly off, because a multi-step calculation tripped somewhere the fluent narrative gave no warning of. A reader who trusted the confident write-up would never suspect the number. That is the exact shape of the risk: the part that reads most authoritative is not the part most likely to be right.

The useful part: a reliability diagnostic

This is where the paper earns its place on a desk. Rather than a verdict on whether LLMs can replace analysts, it offers a diagnostic. Forecast accuracy tracks observable processing features, specifically the model’s narrative focus, the quality of its numerical reasoning, and its self-assessed confidence. Read those features and you can separate a forecast you should trust from one you should not, case by case, instead of trusting or dismissing the model wholesale.

When to trust the model's forecast

The paper's diagnostic: accuracy tracks observable processing features. Reading them tells a trustworthy forecast from a doubtful one, case by case, rather than treating the model as uniformly good or bad.

That framing is exactly what a model-risk function wants. A model reliable in detectable conditions and shaky in others is a tool to be gated, not replaced. You run it where the diagnostic says it is calibrated. You route the rest to a human or a check. The capability and the control arrive together, which is rare and valuable.

The self-assessed-confidence feature deserves a flag of its own, because it is the most actionable and the most dangerous. A model that knows when it is unsure is a gift: you gate the low-confidence cases automatically. But self-assessed confidence is itself a model output. A model can be confidently wrong, which is the failure mode that does the most damage. The diagnostic works because confidence is read alongside the other features, narrative focus and numerical-reasoning quality, rather than trusted alone. Use the confidence signal, and never as the only gate.

How I would use it

Pair the model with the analyst, gated by the diagnostic. Use the LLM for what it does reliably: reading the narrative, surfacing the earnings-relevant points, drafting the qualitative take. Treat its numbers as a hypothesis to check rather than an output to trust, because its numerical reasoning is exactly the kind that fails in scattered, hard-to-anticipate ways. Build the diagnostic features into the pipeline. The system can then flag its own low-confidence, weak-numerical-reasoning cases for human review automatically, which turns the study’s insight into an operational control rather than a caveat in a footnote.

On a desk that blends discretionary and systematic work, this is the natural fit. The analyst keeps the judgment and owns the number. The model accelerates the reading and drafts the narrative, at a scale no human team could match across a coverage universe. The result is an analyst with a fast, well-understood assistant and a clear line for where the assistant’s word is good. That is a more valuable arrangement than either the breathless replacement story or the dismissive one. It is the one the evidence actually supports.

One caution on the diagnostic itself: it was derived on the study’s data. You should re-derive it on yours. The relationship between processing features and accuracy may differ for your sectors, your question types, your model version. Treat the framework as the right idea, that observable features predict reliability, while treating the specific thresholds as something to calibrate on your own forecasts before you let them gate anything that matters. The framework is portable. The thresholds are local, and earning them on your own data is worth the effort.

An LLM reads an earnings release like an analyst but reasons through the numbers unevenly. The win is the diagnostic: trust its forecast where narrative focus, numerical reasoning, and confidence line up, and gate it where they do not.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →