// Insight

Llama 3.1 405B: a frontier model you can run behind your own firewall

August 4, 202410 min read

open-weightsfrontierLlamaon-prem

The headline result is not a benchmark, it is a location. Llama 3.1 405B is the first openly available model that matches the closed frontier. You can run it inside your own firewall. Meta reports MMLU at 87.3, which sits alongside GPT-4-Turbo at 86.5 and Claude 3 Opus at 86.8. On the tasks a research desk cares about, the open weights are now level with the models you used to rent. For anyone who cannot send proprietary data to a vendor API, that changes the build-versus-buy question entirely.

I have spent years watching teams want frontier capability and refuse to ship their signal logic, their positions, or their compliance-sensitive documents to a third party. Until now the answer was a smaller, weaker on-prem model and a quiet compromise on quality. Llama 3.1 405B removes the compromise. The capability you can host has caught the capability you could only call.

What actually shipped?

Meta released three sizes, 8B, 70B, and 405B, all with a 128K context window. The 405B is the one that matters here. It was trained on more than 15 trillion tokens using more than sixteen thousand H100 GPUs, the first Llama model at that scale. The weights are open on Hugging Face. The full recipe is laid out in The Llama 3 Herd of Models, which is itself part of why this matters: the training is documented in detail rather than hidden behind an endpoint.

That openness is worth dwelling on, because it is the difference between a model you use and a model you can actually trust in a regulated setting. You can read how it was trained, inspect what went into it, run it offline, and keep the exact version forever. A closed API gives you none of those. The benchmark parity is the headline. The auditability is what lets the model into workflows a vendor API could never enter.

The benchmark numbers hold up across the board, well beyond MMLU.

Llama 3.1 405B across benchmarks (%)

On MATH the 405B reaches 73.8, behind GPT-4o at 76.6 and ahead of Claude 3.5 Sonnet at 71.1. On HumanEval it posts 89.0, near the top of the field. GSM8K is effectively saturated at 96.8, in a band with the strongest closed models. The pattern is consistent. It holds frontier level wherever you push it.

MMLU: open weights now meet the closed frontier (%)

Why does an open frontier model change build-vs-buy?

Because the constraint on a quant desk was never only capability. The hard part was never the model’s intelligence, it was being allowed to point that intelligence at your own data. A closed API is a dependency with three costs a research lead feels every quarter: your inputs leave the building, your reproducibility is hostage to a model that can change underneath you, and your unit economics are set by someone else’s price list. An open frontier model retires all three at once.

Take them in turn, because each one is a real line in a real review. Data residency is often a hard compliance requirement rather than a preference: positions, signals, and client documents are not allowed to transit a third party, full stop. For those workflows a closed API is simply a non-option. The best on-prem model used to be a long way behind the frontier. That distance has now closed.

Reproducibility is the one quants undervalue until it bites. A backtest that called a vendor API last year may not be reproducible this year, because the model behind the endpoint was updated or retired. Pinned open weights are a fixed artifact: the model that produced a result in March produces the same result in September. That is the property a research process and a model-risk review both require. A moving endpoint cannot offer it by construction.

What self-hosted open weights retire that a closed API cannot

A closed API reverses all three: inputs transit a vendor, the endpoint can change under a backtest, and cost is a per-token tariff you do not set. What the diagram cannot show is the counterweight, the GPU bill to serve a 405B model.

The third cost is economics. A high-volume scoring workload priced per token is at the mercy of a vendor’s price list and rate limits. The same workload on weights you host is priced by your hardware, which you control and amortise. For occasional use the API is cheaper. For sustained, large-scale batch work the open model wins on cost as well as control.

What does on-prem actually buy you?

Three things, in the order a desk values them. First, control of data residency: price-sensitive positions and signals never transit a vendor. Second, control of the model version: a result stays reproducible and auditable for as long as you keep the weights. Third, control of cost at scale: a high-volume scoring workload is priced by your hardware rather than a per-token tariff. The first is often a hard compliance requirement. The second is what lets you defend a model to a risk committee. The third is what makes large batch workloads affordable.

This is the same logic a desk already applies to data and infrastructure. You bring critical capability in-house when the dependency, the reproducibility, or the economics demand it. You rent what is genuinely commoditised. Frontier reasoning has just moved across that line for any team with the hardware to host it. The decision is now the ordinary make-versus-buy calculation a quant platform makes about every other critical component, rather than a forced compromise on quality.

Is the open model actually as good?

Close enough that the gap is no longer the deciding factor, with the usual caveats. Benchmark parity is not the same as task parity. A number like MMLU can be inflated by contamination or fail to capture the messy, domain-specific work a desk actually does. The honest claim is narrower than the headline: on broad reasoning, math, and code, the 405B is in the same band as the strongest closed models, and where it trails it trails by a point or two rather than a tier.

For a research workflow, a point or two of benchmark difference is almost never the constraint. The constraint is whether the model can be deployed at all on your data. An open model that is within a point of the frontier and inside your firewall beats a closed model two points ahead that your compliance team will not approve. Measure both on your own tasks before you decide, because the only benchmark that settles it is the one built from your work.

What does it actually cost to serve?

The honest counterweight to all of this is the hardware. A 405-billion-parameter model in full precision needs on the order of eight high-memory data-center GPUs just to hold the weights, before you account for the key-value cache that a 128K context demands. That is a real capital line. It is why the 405B is not the model you reach for on every call.

In practice a desk runs a tiered setup. The weights are quantised, often to eight or four bits, which cuts the memory footprint by half or more at a small, measurable quality cost. Quantisation costs a little accuracy. On most desk tasks the drop from a careful eight-bit quantisation is within the noise of the work itself, which makes it the default. The 70B carries the high-volume work, where its lower serving cost matters more than the last few points of capability. The 405B is reserved for the jobs that genuinely need frontier reasoning, where the answer is worth the GPU time. The router that decides which model sees a request is doing the same job a cost-aware desk always does: spending the expensive resource only where it pays.

The comparison with a closed API is therefore a fixed capital and operations cost against a variable per-token one. For occasional or bursty use, the API wins on economics, because you pay only for what you call. For sustained, high-volume batch work over data you cannot send out anyway, the owned hardware wins, because the per-document cost falls as you amortise the fixed investment across millions of calls. The break-even depends on your volume. It is a calculation worth doing explicitly rather than assuming the answer. Most desks find the volume that justifies hosting is lower than they expected, once the compliance constraint rules the API out for the sensitive workloads anyway.

The point is that open weights remove the dependency while leaving the engineering. You still have to provision, quantise, serve, and monitor the model. Those are real skills with real headcount behind them. A desk that has run its own data and execution infrastructure already has most of them. A team that has only ever called an API is taking on a new operational burden, which belongs in the build-versus-buy calculation alongside the benefits.

Where is the catch?

The 405B is large. Large is not free. Serving it at low latency takes serious GPU memory. Most desks will run a quantised version or reserve the full model for the jobs that justify it. For high-throughput, latency-sensitive scoring, the 70B is often the right tool, with the 405B held back for the hard reasoning that earns its cost. The open weights remove the vendor dependency. They do not remove the engineering bill for inference.

The licence is permissive but not unconditional. Read it before you build a product on top. An open model is a tool. It is not a guarantee. It will still hallucinate a number into a clean sentence. The verification and abstention layers you wrap around any model belong here too. Hosting the frontier does not change the discipline. It changes who holds the keys.

How would I deploy this on a desk?

As the reasoning engine for anything that cannot leave the building. Run the 70B as the everyday workhorse for extraction, screening, and first-pass analysis. Reserve the 405B for the work where frontier reasoning genuinely pays: factor hypothesis generation, complex document analysis, code review of a research stack. Pin the weights, version them like data, and record which model produced which result so a backtest stays reproducible. Keep the verification gate in front of every output, because an on-prem model hallucinates exactly as readily as a hosted one.

The deeper shift is strategic. For two years the best reasoning was something you could only rent, which meant the most capable research assistants were off-limits to the most compliance-sensitive desks. Those are often the desks with the most valuable, most sensitive data, which is precisely why they could not use the best tools. Llama 3.1 405B is the moment that stops being true.

None of this is a recommendation to rip out a working API integration. For many teams the hosted model is the right call. The open weights are simply an option that did not exist before. The shift is that the option now exists at frontier quality, which means the decision is a real trade-off rather than a forced one. The teams that benefit most are the ones whose data was always too sensitive to send out, which are often the ones with the most to gain from a capable assistant.

The bottom line

Llama 3.1 405B brings frontier reasoning inside the firewall: matching the closed models on the benchmarks that matter, on weights you can pin, host, and audit.

The contribution is not a new high score. The shift is that frontier-level capability is now an artifact you can own, run on your own hardware, and reproduce a year later. For a quant desk, that retires the three standing costs of a closed API at once: data residency, reproducibility, and unit economics. The catch is the inference bill. The discipline around hallucination does not change. Treat the 405B as the in-house reasoning engine for work that cannot leave the building, hold the 70B for throughput, and keep verifying what either one tells you. The frontier just moved to where you can govern it.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →