// Insight
Kronos: a foundation model for candlesticks, and the scrutiny it invites
Foundation models came for text, then images, then generic time series. Kronos is among the first built for the specific dialect of market data: the candlestick. It is a decoder-only model pre-trained on more than 12 billion K-line records from 45 exchanges, using the same recipe that built the language models, tokenize the data and train a transformer to predict the next token. The idea is to learn the language of markets the way GPT learned the language of text. The zero-shot forecasting numbers are large enough to demand a serious look. They are also large enough to demand the scrutiny a quant brings to anything that claims to forecast returns.
The recipe, and where it comes from
The lineage is explicit and worth stating, because it tells you what Kronos is and is not. The direct ancestor is Chronos, Amazon’s 2024 model that learned the language of generic time series by scaling and quantizing a series into discrete tokens, then training a language model on them with the ordinary next-token objective. Chronos showed the recipe works on time series in general. Kronos asks whether it works on the hardest and most adversarial time series there is, the price and volume of a traded instrument.
The contribution that earns the new name is the tokenizer. Generic time-series foundation models, Chronos included, have historically underperformed on financial candlesticks, because an OHLCV bar is not a single number. It is a small structured object, four prices and a volume, whose internal relationships carry the signal: the close relative to the high, the volume confirming or contradicting the move. A tokenizer that flattens that loses exactly what a trader reads. Kronos uses a specialized tokenizer built to discretize a bar while preserving the joint price-and-volume dynamics, which is the part of the design that matters and the reason it can claim to model markets rather than just a number that happens to come from a market.
There is a subtlety in the 45-exchange corpus worth flagging. Training across many markets can teach genuinely transferable structure, the grammar of how price and volume move that holds in Tokyo as in New York, or it can simply average over regimes until the model is mediocre everywhere and sharp nowhere. Which one you get is an empirical question the headline numbers do not answer. It matters most for a desk trading one market, where a model diluted by forty-four others may underperform a smaller model trained only on the venue you actually trade. Breadth of pre-training is an asset only when the structure it learns transfers, a claim to test rather than assume.
The rest is the familiar machine. A decoder-only transformer is pre-trained autoregressively on the token stream, learning to predict the next candle from the preceding ones, across 45 exchanges and 12 billion bars of history. Once trained, it generates future candles the way a language model generates text, and those generations become forecasts, volatility estimates, or synthetic series.
The results, read carefully
The headline numbers are strong. They need careful reading rather than a flat restatement, because the metric is unforgiving and easy to misunderstand.
On zero-shot price forecasting, Kronos improves RankIC by 93% over the leading time-series foundation model and 87% over the best non-pre-trained baseline. RankIC is a rank correlation between the forecast and what happened. The crucial context is that RankIC on financial returns is tiny for everybody, because returns are mostly unforecastable. A 93% relative improvement over a baseline whose RankIC is near zero is a real and interesting result about modeling, yet still a small absolute number, nowhere near the predictability the percentage might suggest to someone who has not worked with the metric. The honest reading is that Kronos is meaningfully better than the alternatives at a task where everyone is close to the noise floor.
The other results follow the same shape. Volatility prediction comes in at 9% lower mean absolute error, which is a more interpretable and frankly more useful claim, because volatility genuinely is more forecastable than direction and a 9% error reduction there has real applications in risk and sizing. Synthetic data generation improves by 22% on a fidelity measure, which matters for a different reason I will come back to. Across all of it, one word does the heavy lifting: zero-shot. These are benchmark numbers rather than live results.
The scrutiny a market backbone invites
This is where a quant has to earn the salary. A model that claims to forecast markets is the most over-claimed object in the field. The history of the discipline is a graveyard of backtests that worked until money was on them. Every concern we learned to bring to NLP foundation models returns here, sharper, because the target is returns rather than the next word.
Take the gates in turn. The first is leakage, the one a 12-billion-bar pre-training corpus makes acute. A zero-shot claim means the test instrument and period were not in training, and with a corpus that large, spanning 45 exchanges and years of history, establishing that cleanly is genuinely hard. If any overlap exists between what the model saw and what it is tested on, the zero-shot number is contaminated. The contamination flatters the result in precisely the direction that sells the model.
The second is regime. Markets are non-stationary in a way that text is not, because the data-generating process itself changes when participants adapt. A model pre-trained on history has, at best, memorized the regimes in that history. A benchmark drawn from the same era is not an out-of-regime test. It is an in-sample test wearing a zero-shot label. The real question, how the model behaves in a regime that has never happened before, is the one no historical benchmark can answer.
The third is cost, where forecast skill meets reality. RankIC is a statistical measure of skill rather than a profit-and-loss statement. A model can have genuine, leakage-free forecast skill and still lose money, because the skill is too small to survive transaction costs and slippage, which is the usual fate of a tiny edge in returns. The paper does not address costs, slippage, or live deployment. That omission is the gap between an interesting research result and a tradable one. This is the same discipline that an LLM-plus-evaluator search demands and that the deflated Sharpe ratio formalizes: a number measured in the lab is a hypothesis about the world, which charges fees.
The fourth gate is the only one that ever settles the question, which is forward performance. A model earns trust by holding up out of sample, on paper, before a desk commits capital, and no quantity of historical benchmark wins substitutes for it. That is not a knock on Kronos specifically. It is the standing rule for anything that claims to predict prices. The rule exists because the alternative has cost the industry more money than any other single mistake.
There is a deeper reason to discount a forecasting foundation model, one with no analogue in text. Language is not adversarial. The next word in a sentence does not change because a model learned to predict it. A market does. If Kronos genuinely forecast returns on a liquid instrument, and if it were deployed at the scale its open release invites, the trades acting on the forecast would move the price toward it and arbitrage the edge away. The forecast would hold true until enough capital believed it, then turn false. This is the efficient-market reflex. The durable uses of a market model are the ones that never depend on out-predicting everyone else. The forecasting headline is precisely the part the market is built to erase.
What it is actually good for
None of that skepticism means the model is worthless, and reading it that way would be its own error. The lesson from NLP is the useful one here. A foundation model’s value is rarely the zero-shot output you get for free. It is the representation underneath, the pre-trained backbone you fine-tune on your own problem and validate with your own rigor. Kronos as a market-data backbone, fine-tuned and tested honestly on a specific desk’s instruments and horizon, is a plausibly useful piece of infrastructure even if its zero-shot forecasting claims do not survive the four gates. That is how NLP foundation models earned their place: not by being trusted out of the box, but by being a strong starting point that teams adapted and checked.
The synthetic-data result is the one I would put to work first, because it sidesteps the forecasting trap entirely. A model that generates realistic market series with 22% better fidelity is useful for stress-testing, for augmenting thin datasets, and for building scenarios that did not happen but could have, and none of those uses requires the model to predict the future. It only requires it to produce plausible pasts, which is a far safer thing to ask of it and a genuinely valuable one. Concretely, a desk can sample thousands of plausible alternative histories for an instrument and measure how a strategy or a risk model behaves across them, a far richer stress test than the handful of real crises the historical record happens to contain. The model’s job there is to be a good simulator of the past rather than a prophet of the future, which is a job market data can actually support.
So treat Kronos the way the field eventually learned to treat every foundation model. The pre-trained backbone is real progress and the zero-shot forecasting numbers are a hypothesis. The work of turning the first into an edge is the same work it has always been: hold out the data honestly, test across regimes, charge realistic costs, and prove it forward before you believe it. The model is a better starting point than the field had before. It is not a shortcut around the part of the job that has always been the job.
Kronos applies the tokenize-then-predict recipe of language models to 12 billion candlesticks, and its zero-shot RankIC gains are real but small in absolute terms, because returns resist forecasting. Treat it as NLP taught us to treat foundation models: a strong pre-trained backbone to fine-tune and validate yourself, not a zero-shot forecaster to trust. The synthetic-data use is the safest place to start.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.