// Insight

Kolmogorov-Arnold Networks for time series: a volatility model a risk committee can read

June 15, 202411 min read

KANforecastinginterpretability

Here is the result worth sitting with. It comes straight from a markets problem. On daily market data, a Kolmogorov-Arnold Network forecasts implied volatility about as accurately as an LSTM while using roughly sixty times fewer parameters. It returns that forecast in a form you can read. The paper, Kolmogorov-Arnold Networks for Time Series by Xu, Chen, and Wang, predicts implied volatility from a realized-volatility estimator on ten years of OHLCV data. Its headline is not a leaderboard win. It is parameter efficiency plus interpretability: a model with 193 parameters that holds its own against an 11,671-parameter LSTM, and whose fitted functions can be written down as symbolic expressions and watched for drift.

That combination is unusual enough to take seriously. In sixteen years on a quant desk, much of it on model-risk governance, the models that clear a committee are almost never the most accurate ones on a held-out sample. They are the ones whose behavior somebody can explain when the position moves against us. A volatility model that is small enough to overfit less, and legible enough to defend, is aimed squarely at that constraint. The catch, which the paper states plainly, is that KANs train about ten times slower than an equivalent MLP. The evidence here is one task on a handful of stocks. So the right posture is interested but uncommitted.

What did the paper actually test, and what did it find?

The experiment is concrete. The authors take daily open, high, low, close, and volume data from January 2012 to June 2022, compute a realized-volatility estimator, and use a rolling window of 84 lagged steps to forecast volatility 21 days ahead. They compare two KAN variants, T-KAN and MT-KAN, against a tuned LSTM and a tuned MLP on the same task. The accuracy is competitive in every case. The parameter counts are not close.

Parameters for the same volatility forecast

Forecast error, MSE x1e-5 (lower is better)

Read those two charts together, because that is where the argument lives.

MT-KAN posts the lowest error in the table while using under a fifth of the LSTM’s parameters and a tenth of the MLP’s. T-KAN, with just 193 parameters, lands between the two recurrent and feedforward baselines on error. For a quant, parameter parsimony is not an aesthetic preference. Fewer parameters means a smaller hypothesis space, which means less room to fit noise, which is the failure that destroys volatility models out of sample. A method that reaches comparable accuracy with one or two orders of magnitude fewer free parameters is making a real claim about generalization, bigger than any point about model size.

One honest caveat about the evidence before we go further: this is a single task, a single 21-day horizon, on a small set of names, with no out-of-sample regime stress reported. The numbers are real and encouraging. They are not a benchmark sweep. I would not let anyone present them as one.

What actually changes when you move the function to the edge?

A standard multilayer perceptron fixes the nonlinearity and learns the weights. Each neuron applies a fixed activation, ReLU say. Learning means adjusting the linear weights feeding into it. The original KAN paper from Liu and co-authors inverts that arrangement. There are no linear weights at all. Every weight is replaced by a univariate function, parametrized as a spline. Those functions live on the edges. The nodes do nothing but sum. Learning means reshaping the curves on the edges.

T-KAN: a learnable spline on every edge (the paper's [84, 5, 21] net)

The paper's [84, 5, 21] network, drawn the way the KAN papers draw it. Every edge carries its own learnable univariate spline, shown as a curve. Each node only sums its incoming functions. An MLP inverts this: fixed activations on the nodes, scalar weights (straight edges) on the edges.

The theoretical backing is the Kolmogorov-Arnold representation theorem, which says any multivariate continuous function can be written as a finite composition of univariate functions and addition. KANs are a literal architecture for that statement: stacks of learnable univariate functions, composed. The practical consequence the original authors report is that much smaller KANs match or beat much larger MLPs, with faster neural scaling, which is exactly the parameter-efficiency pattern the volatility experiment reproduces.

But the property that matters most here is legibility. Because each edge is a single-variable curve, you can plot it. You can look at the function the model learned for “last month’s realized volatility feeding into the 21-day forecast” and see whether it is monotonic, whether it saturates, whether it has a kink. That is a different epistemic position from staring at the gates of an LSTM. A spline on an edge is an object a human can reason about.

How does this compare to the interpretable models a desk already uses?

This is the comparison that decides whether KANs are worth the trouble, because a desk does not adopt interpretability in the abstract. It adopts it against an incumbent. The honest reference point is the generalized additive model, which a quant reaches for precisely when a linear model is too rigid and a black box too opaque. A GAM fits a separate smooth function of each input and adds them up, which is, structurally, close to what a single KAN layer does: univariate functions on the inputs, summed. If you already trust spline-based additive models for yield-curve fitting or nonlinear factor exposures, a shallow KAN is a familiar object wearing new clothes.

What KANs add over a plain GAM is depth and composition. Stacking layers lets the learned univariate functions feed into further learned univariate functions. The model can then capture interactions a single additive layer cannot, while in principle keeping each piece readable. That is the appeal: additive-model legibility with more expressive reach. The cost is that the legibility degrades with exactly the depth that buys the expressiveness, which is the tension at the center of this whole architecture. The paper’s networks are shallow (two layers, five hidden nodes), which is the regime where the interpretability claim is most credible and the regime its experiments actually test.

My rule of thumb after building these: reach for a GAM or a regularized linear model first, because they are better understood and easier to defend. Move to a shallow KAN only when you need interactions a GAM cannot give you and a model you can still read. If you need maximum accuracy and owe no one an explanation, a gradient-boosted tree usually wins. KANs earn their place in the narrow middle where legibility and interaction both matter, which is a smaller set of problems than the excitement suggests.

What does T-KAN add for an interpretable forecast?

Two things the baselines cannot offer. First, symbolic regression: because each edge is a univariate spline, the authors fit a closed-form mathematical expression to it, turning the learned activation into a human-readable function of its input. You get a forecast and an equation from the same model. Second, concept-drift detection: T-KAN is framed as an ensemble of KANs that evolve over time. Because different fitted structures encode different relationships, a change in the learned functions is itself the signal that the data-generating process has shifted.

One model, two outputs: a forecast and an explanation

The same fitted splines that produce the forecast are fit with symbolic regression to yield a readable expression; comparing that expression across windows surfaces concept drift, rather than waiting for error to accumulate.

The drift framing is the part I find most useful, because regime change is the problem that quietly destroys volatility models. Most monitoring catches it late, through a degraded error metric, after the model has already been wrong with money on the line. A method that surfaces the change as a change in the learned function, before the loss moves, is attacking the problem at the right layer. MT-KAN extends the idea to many series at once, flattening the histories of all variables into one network ([845, 5, 215] for five variables) so that each forecast can use the cross-variable structure. That is what earns MT-KAN its lower error. It is also the more fragile claim, because uncovering relationships between variables from finite, noisy financial data is precisely where spurious structure creeps in.

Where does the story get oversold?

Here is where a practitioner has to push back. The interpretability is real at the level of mechanism. The case for deploying it is not yet made, for four specific reasons.

The first is the training cost, which the paper itself flags: a KAN is roughly ten times slower to train than an MLP of the same parameter count, because its diverse per-edge activations do not batch the way a shared activation does. For a model you refit nightly across a large universe, a tenfold training penalty is a real line item. It partly offsets the elegance of the small parameter count.

The second is overfitting, which cuts the other way from the parameter story. A spline is a flexible object. Give it enough knots and it will fit noise beautifully. A readable function that has memorized the sample is worse than a black box, because it is legible and wrong, which is more persuasive than illegible and wrong. The low parameter counts here are reassuring. The right control is out-of-sample stability across regimes, which this paper does not report.

The third is the stability of the symbolic readout. Symbolic regression is notoriously sensitive: small changes in the data or the fitting procedure can hand you a different formula that fits about as well. If the recovered expression is unstable across reasonable resampling, the interpretation is an artifact of one fit rather than a property of the world. Before trusting a single recovered formula, I would want to see how much it moves under bootstrap resampling and across rolling windows.

The fourth is the thinness of the evidence. One task, one horizon, a small set of names, no transaction-cost or regime-stress analysis. The numbers are competitive. The idea is sound. A desk still does not allocate capital on a single table. Treat this as a strong hypothesis, still unproven on a desk.

How would I test this on a desk?

Not by reading the table and reaching for the library. The protocol I would insist on is the one we use for any new forecasting method, because the failure modes are identical.

Run a strict walk-forward evaluation, never a single split, because a single split leaks the future into the past in a time series and quietly inflates every metric. Widen the universe well beyond a handful of names and report performance through at least one volatility regime change, since a volatility model that has not been tested across a regime shift has not been tested. Compare against the honest incumbents on equal footing: a regularized GAM, a tuned LSTM, and gradient-boosted trees, all costed for training and inference. Measure the stability of the recovered splines and symbolic forms across windows, because the interpretability is the actual product here. An unstable interpretation fails even when the point forecast is fine. And deflate any apparent edge for the number of configurations you tried, the way the deflated Sharpe ratio penalizes a backtest for the trials behind it, because a flexible architecture with tunable knots is a multiple-testing machine if you let it be.

If the symbolic readouts come back stable, economically sensible, and competitive out of sample, you have something genuinely useful: a small, legible volatility model that explains itself in a form a risk committee can interrogate. If they come back unstable or only competitive in-sample, you have a more elaborate way to overfit, dressed in the language of interpretability. Both outcomes are informative. Only one is bankable. A single table on six stocks does not yet tell you which you will get.

The bottom line

On a real volatility task, KANs match an LSTM at a fraction of the parameters and return a forecast you can read; the price is roughly ten times slower training and evidence thin enough that you test it, not deploy it.

The contribution is genuine and well aimed at finance. On a real implied-volatility task, KANs match or beat an LSTM and an MLP with one to two orders of magnitude fewer parameters. They produce a forecast you can read as a formula and monitor for drift. That is the rare combination of parsimony and legibility a volatility desk actually wants.

The cautions are equal in size. Training is about ten times slower than an MLP. Splines can overfit. The symbolic readout may not be stable. The evidence is a single task on a few names with no regime stress. Treat T-KAN and MT-KAN as a promising, interpretable, small-data technique to test against your own baselines under a walk-forward protocol before it earns any capital. The error metric here is merely competitive. The real draw is that, for once, the model might be able to tell you what it learned.

Working on AI that needs to ship?

I help funds, fintechs, and data teams take AI from prototype to production.

Get in touch Read the book →