// Insight

Native Sparse Attention: cheaper long context, trained in from the start

February 18, 20257 min read

long-contextattentionefficiency

Attention is quadratic, which is why long context is expensive. Double the sequence and you quadruple the cost, so feeding a model a whole 10-K, a full earnings call, or a long order-book history runs straight into a wall. Native Sparse Attention, from DeepSeek, attacks that wall by being sparse from the start. NSA matches full attention on quality while running up to 11.6 times faster at 64k context, because the sparsity is trained in rather than bolted on.

Sparse attention is not new. The usual approach takes a model trained with full attention and prunes the connections afterward, which saves compute at inference but tends to cost accuracy, because the model was never trained to work that way. The pruning is a compromise imposed on a model that learned to expect every token. NSA’s claim is that sparsity should be part of training from the beginning. The model then learns to use it rather than merely tolerate it.

It helps to see where NSA sits in a decade of attempts. Earlier efficient-attention methods like Longformer and BigBird used fixed sparse patterns, attending to a local window plus a few global tokens chosen by a rule rather than learned. They cut cost. They also left accuracy on the table, because a fixed pattern cannot adapt to which tokens actually matter for a given query. NSA makes two advances over that line. The selection is learned rather than hand-designed. And it is trained in from the start rather than applied to a finished model. Those are the two places the older methods compromised. They are exactly the two NSA refuses to.

How does NSA work?

Three branches, blended by a learned gate. For each query, NSA does not read every previous token. It reads three cheaper views instead. The first is a compressed view, where nearby tokens are aggregated into coarse blocks, giving a low-resolution picture of the whole sequence. The second is a selected view, the handful of blocks that matter most for this query, ranked using the compressed attention and read at full resolution. The third is a sliding window over the most recent tokens, for local context. A learned gate decides how much each branch contributes to the answer.

Native Sparse Attention: three branches, one gate

Instead of attending to every token, NSA reads a compressed global view, the few blocks that matter most, and the recent window, then blends them with a learned gate. The block selection is trained end to end, not chosen by a fixed rule.

The two phrases that carry the contribution are hardware-aligned and natively trainable. Hardware-aligned means the sparse pattern is designed to run efficiently on a GPU, reading contiguous blocks into fast memory rather than scattering reads across the sequence, which is what made many earlier sparse schemes slow in practice despite looking cheap on paper. Natively trainable means the whole mechanism is differentiable and trained end to end. The model learns the selection during pretraining instead of having it imposed afterward, which is the difference that preserves quality.

This is the distinction I have always seen between a fast car and a fast-by-design one. You can take a road car and strip weight, add a wing, swap the brakes. It will get quicker. A car designed light and aerodynamic from the chassis up is in another class, because every part assumes the others. NSA is the second kind. The sparsity is in the foundation, not taped on at the end. That is why it keeps the quality the bolt-on methods give away.

How much faster, and at what cost to quality?

Faster across the board. The quality cost is negative.

NSA speedup over full attention at 64k context (x)

At 64k context, NSA runs 9.0 times faster on the forward pass, 6.0 times faster on the backward pass, and 11.6 times faster at decoding. Those are the numbers that decide whether long context is affordable at scale. They cover training as well as inference, because the speedup applies to the backward pass too. Quality does not drop. NSA slightly beats full attention, averaging 0.456 against 0.443 across nine general benchmarks, and stays ahead on long-context tasks specifically. A method that is both faster and better is rare enough to be worth double-checking. The result holds across their tests, which is what makes the paper notable rather than merely promising.

Why does this matter for finance?

Long context is the enabling layer for document-heavy finance, and cost has been the thing standing in its way. A 10-K runs to hundreds of pages. An earnings-call transcript is tens of thousands of tokens. An order-book history is effectively unbounded. The standard workaround is to chunk and retrieve, which works but throws away the cross-references a long document depends on, the way a risk factor disclosed on page 40 qualifies a revenue number on page 12. A retrieval system that fetches the page-12 number without the page-40 caveat hands the model half the picture.

Long-context models can read the whole document and keep those links intact. The reason not to do it everywhere has been the quadratic cost. NSA attacks exactly that cost. Cheaper long context means a model can read a full filing, a complete transcript, or a long stretch of market history without the truncation that quietly discards half the signal. For the document-heavy end of finance, where the answer often lives in the relationship between two passages far apart, that is not a minor efficiency. It is the difference between reading the document and sampling it.

A concrete case makes the point. Ask a model whether a company’s revenue guidance is consistent with the risk factors it disclosed. The answer requires holding the guidance section and the risk-factor section in mind at once, often a hundred pages apart. Chunked retrieval tends to fetch one and miss the other, because the two are logically linked rather than textually similar. A similarity search cannot see a logical link. A model that reads the whole filing sees it. Cheaper long-context attention is what makes reading the whole filing routine rather than a budget decision reserved for the most important questions.

Where is the catch?

NSA is an architecture rather than a patch. The gains come from training a model with NSA from the start, which means you cannot sprinkle it onto a model you already run. To get the benefit you need a model pretrained this way, which today means waiting for one or training your own, and training your own is a lab-scale undertaking. For most desks the practical path is to adopt long-context models built on efficient attention as they ship, rather than to build the attention mechanism themselves.

The speedups are also a long-context phenomenon. At short context, where full attention is already cheap, there is little to gain, and NSA’s machinery would be overhead. It matters precisely when the sequences are long, which is the regime document-heavy finance lives in. The limitation and the use case line up, which is the most you can ask of an enabling technology: it is most valuable exactly where you need it. For a desk weighing whether to invest in long-context infrastructure, that alignment is the reassuring part, because the cost falls on exactly the workloads that justify it.

NSA makes long-context attention cheap by training the sparsity in rather than bolting it on: up to 11.6 times faster at 64k, with quality that edges full attention. For document-heavy finance, it is the layer that lets a model read the whole filing instead of a sample of it.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →