// Insight

vLLM V1: the unglamorous economics of serving your own models

March 15, 20253 min read

inferenceservingMLOps

The interesting AI news is models. The news that decides your bill is serving. vLLM V1, a ground-up rewrite of the most widely used open-source inference engine, delivers up to 1.7x higher throughput than its predecessor. For a shop that self-hosts its models, a 1.7x cut in serving cost is not a footnote, it is a line item that moves the build-versus-buy math.

V1 is a re-architecture rather than a tuning pass. The team rewrote the core: the scheduler, the KV cache manager, the worker, the sampler, the API server. The scheduler is simpler, dropping the old prefill-versus-decode distinction and treating all tokens uniformly. The execution core is isolated in its own process, so CPU overhead from tokenization and request handling stops stealing time from the GPU. The result is a cleaner engine that keeps the expensive hardware busier.

Prefix caching: pay for the shared prompt once

When requests share a prompt prefix, V1 reuses the cached attention state instead of recomputing it. The implementation adds near-zero overhead even at a 0% hit rate, so it is on by default.

Why prefix caching is the quiet win

The change that matters most for a desk is prefix caching, now on by default. Many requests share a long, identical prefix: the same system prompt, the same documents, the same instructions, with only the final question changing. Without caching, the model recomputes the attention state for that shared prefix on every request. With it, the state is computed once and reused, so each request pays only for its new tokens. The V1 implementation makes the cache near-free even when nothing hits it, which is why it can be enabled by default. For a workload like scoring a coverage universe against a fixed prompt, or answering many questions about the same filing, that is most of the compute saved.

The rewrite matters for a subtler reason too. As models get faster to run, especially smaller ones on modern GPUs where a forward pass can take only a few milliseconds, the bottleneck shifts from the GPU to everything around it: tokenization, scheduling, streaming the response. The old design let that CPU-side work steal time the GPU could have used. V1’s isolated, multiprocessing core attacks exactly that, which is why the gains show up most on smaller models and high-throughput workloads, the regime a desk running many cheap queries actually lives in. It is the least glamorous kind of progress, a faster scheduler and a smarter cache, the kind that quietly decides a quarterly compute bill.

Why this matters

Self-hosting an open model, the R1 distilled reasoners or a long-context model, only makes economic sense if you can serve it efficiently. The capability is free once the weights are open. The cost is all in the serving. That cost is what a 1.7x throughput gain attacks directly. A shop weighing whether to host its own models against renting an API is really weighing its serving efficiency against a vendor’s price list. An engine that does more work per GPU tilts that comparison toward in-house. The effect is largest on the high-volume, fixed-prompt workloads where prefix caching does the most work, which is much of what a research desk runs. The model gets the headlines. The inference engine pays the bills, and tools like vLLM V1 are what keep the self-hosting math working as the models you run and the query volume both grow, which on a research desk they always do.

Open weights make the capability free; the serving engine decides whether you can afford it. vLLM V1’s 1.7x throughput gain, with prefix caching on by default, is the kind of unglamorous infrastructure that makes self-hosting pay.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →