// Insights
Writing
Field notes on production AI and quantitative finance, what actually transfers from research to the desk.
Kronos: a foundation model for candlesticks, and the scrutiny it invites
Kronos applies the language-model recipe to market data: tokenize 12 billion candlesticks, train a decoder to predict the next one, read off forecasts. The zero-shot numbers are large. The quant's job is to ask the questions a benchmark cannot answer, about leakage, regime, and whether forecast skill survives the cost of trading on it.
GEPA: improving an agent by reading its traces, not its gradients
GEPA tunes an LLM system by reflecting on its own execution traces in plain language, proposing prompt edits, and keeping a frontier of what works. It beats a reinforcement-learning baseline by up to 20% with up to 35x fewer rollouts. For a desk with dozens of labeled examples rather than thousands, that sample efficiency is the whole game.
FinDPO: a good idea about sentiment, and a backtest to distrust
FinDPO aligns a financial sentiment model with preference optimization instead of supervised fine-tuning, and the generalization argument is sound. Then it reports a 67% annual return at Sharpe 2.0, which is where a quant should stop nodding and start checking for look-ahead and the gap between a sentiment score and tradable alpha.
Qwen3: one open model with a dial between thinking and throughput
Qwen3 ships eight Apache-2.0 models, but the feature that matters for a desk is the dial: a single model that switches between deep step-by-step reasoning and fast cheap inference. Run heavy chain-of-thought for research and drop to throughput mode for signal scoring, on weights you host yourself.
HiREC: when every 10-K looks the same to your retriever
Standardized filings are full of near-identical boilerplate, and a flat retriever happily cites the wrong company's identical-looking risk-factor paragraph. HiREC fixes it structurally: retrieve documents before passages, then curate the evidence and ask for what is missing. It is more accurate and cheaper at once.
AlphaEvolve: automated discovery, and why the evaluator is the whole game
AlphaEvolve pairs Gemini with an automated evaluator in an evolutionary loop and finds things people missed, including a 4x4 matrix-multiplication algorithm better than any since 1969. For a quant the template is automated strategy discovery, and the lesson is severe: the loop optimizes your evaluator with superhuman efficiency, leaks included.
Navigating the alpha jungle: an LLM that mines factors, and the harness it still needs
A clever framework has an LLM propose symbolic alpha formulas while Monte Carlo Tree Search refines them against backtests. The method is real. The harder half it leaves to the reader is the one a factor platform lives by: deflating for the thousands of formulas tried, and testing whether any of them survive out of sample.
Uncertainty heads: the confidence score an LLM never gave you
A small supervised head, bolted onto a frozen LLM and reading its attention, flags likely hallucinations far better than perplexity or output probability. For a regulated desk, that is the missing primitive: a calibrated per-claim confidence you can threshold into abstain-or-escalate before a number reaches a report.
FinSage: compliance-grade filing QA, built from unglamorous parts
FinSage answers compliance questions over messy financial filings and beats the field, and the gains come almost entirely from engineering: chunk metadata, four retrieval paths instead of one, and a reranker preference-tuned to surface the chunks a regulator cares about. The DPO reranker is the idea to steal.
DeepSeek-Prover-V2: when correctness is machine-checkable
Prover-V2 proves theorems in Lean 4 by decomposing them into subgoals and training on a reward the proof checker verifies. The template matters for any quant problem where correctness can be formally checked, not just estimated.
Can large language models trade? A simulated market, and a warning about correlation
Lopez-Lira builds a stock market populated entirely by LLM agents and finds it reproduces the textbook stylized facts of real markets, bubbles included. The result a desk should sit with is the one about correlation: shared models trade alike, and alike is how liquidity disappears.
Does RL really incentivize reasoning? A caution for the backtest
A sober study finds RL makes reasoning models better at the first try without expanding what they can ultimately solve. The quant analogy is exact: do not mistake variance reduction for alpha, in a model or in a trading agent.
smolagents: when the agent's action is code
Hugging Face's smolagents is a minimal agent library whose agents write their actions as executable Python, not JSON tool calls. For quant work, where the action often is code, it is a natural and lightweight scaffold.
vLLM V1: the unglamorous economics of serving your own models
vLLM's V1 re-architecture cuts inference cost up to 1.7x with a cleaner core and prefix caching on by default. For a shop self-hosting LLMs, serving throughput is the line item that decides whether in-house pays.
Granular metric extraction from filings: traceability and verification beyond summarization
Clients want an agent that reads the 10-K and returns the number. Extraction, not summarization, is the hard part, and benchmarks say models fail it more than half the time. The build guide for doing it with a source on every figure and a verification gate.
FinRL-DeepSeek: an LLM news signal wired into a risk-aware RL agent
A reproducible, open template that turns financial news into an LLM signal and feeds it to a CVaR-aware reinforcement-learning allocator. The hybrid every discretionary-plus-systematic desk sketches, with the code attached.
GPT as a sell-side analyst: where it beats the human, and where it folds
A sober read on the 'AI replaces analysts' headline. An LLM reads an earnings release like an analyst but reasons through the numbers unevenly, and the useful finding is a diagnostic for when to trust its forecast.
Native Sparse Attention: cheaper long context, trained in from the start
NSA makes long-context attention fast by building sparsity into training rather than pruning it in afterward. Up to 11.6x faster at 64k while beating full attention, and why that is the enabling layer for document-heavy finance.
Kimi k1.5: the second proof that RL makes reasoning
Around the same time as DeepSeek-R1, Moonshot's Kimi k1.5 reached o1-level reasoning with reinforcement learning by a different route. Two independent recipes in one month make the technique a method, not a fluke.
s1: buying reasoning with a budget you control
s1 fine-tunes an open 32B model on 1,000 examples and adds budget forcing, a dial that makes it think longer by appending 'Wait'. Why a controllable, auditable inference-compute knob matters to a quant.
DeepSeek-R1: frontier reasoning goes open
R1 matches OpenAI's o1 on hard math and code, ships openly, and distills into small models you can host. Why the distillation result, not the benchmark parity, is what changes build-vs-buy for a quant desk.
Agentic RAG: when retrieval learns to loop
Agentic RAG replaces one-shot retrieve-then-generate with a loop that plans, retrieves, critiques, and iterates. A map of the patterns, and a blueprint for a research assistant that can catch its own bad retrieval.
RAG for financial documents: a field guide
Grounding an LLM in your own filings is hard because retrieval, not the model, is the bottleneck. The proven moves that fix it, each with the evidence attached, and the discipline that makes the result safe to use.
OpenAI o1: paying for intelligence at inference time
o1 shifts the expensive part of reasoning from training to inference, thinking in hidden tokens before it answers. Where deliberate, costly reasoning pays for a research desk, and where it just burns tokens.
LOBDIF: diffusion models reach the order book
LOBDIF applies a diffusion model to the limit order book, denoising the next event's timing and type from noise. A genuine frontier-ML crossover, and a critical look at whether it beats the point processes it wants to replace.
Model Context Protocol: the integration layer finally gets a standard
Anthropic's MCP is an open protocol that lets any model reach any data source or tool through one interface. Why a standard, modeled on LSP, is what a quant platform's integration layer has been missing.
Tülu 3: an open recipe for post-training your own model
Tülu 3 releases the full post-training stack, data, code, recipes, and RLVR, on Llama 3.1. Why a reproducible recipe for training on verifiable rewards matters for a quant's own checkable tasks.
Transformer covariance for ETFs: the right target, the missing evidence
A working paper forecasts semi-covariance with transformers for downside-aware ETF allocation. The idea hits the real weak link in mean-variance. The evidence is one month long, with no costs, no turnover, and no shrinkage baseline.
Why generic embeddings cap your financial RAG
Finance-tuned BAM embeddings hit Recall@1 of 62.8% against 39.2% for the best general model, and lift FinanceBench accuracy by 8%. The retrieval ceiling is the embedding model nobody swaps.
OLMo 2: the open model a risk committee can actually audit
OLMo 2 releases not just weights but the training data, code, checkpoints, and eval harness. Why that full transparency is the ingredient model-risk governance has been missing.
QwQ-32B: frontier-style reasoning you can self-host
QwQ-32B-Preview is a 32B Apache-2.0 reasoning model scoring 90.6% on MATH-500. Why a self-hostable reasoner changes what a compliance-bound quant team can run on its own data.
FrontierMath: the math benchmark that is not saturated, and what that tells a quant
Frontier models near-perfect GSM8K and MATH, yet solve under 2% of FrontierMath's research-level problems. A sober gauge of how far to trust an LLM on a hard derivation.
LightRAG: graph retrieval that updates without a teardown
LightRAG keeps GraphRAG's cross-document reach but updates incrementally and retrieves for a fraction of the cost. Why incremental graph updates fit the constantly-arriving corpora a desk actually has.
OpenAI Swarm: a teaching toy with a lesson worth stealing
Swarm is an experimental, MIT-licensed framework built on two primitives, agents and handoffs. It is not for production. The handoff pattern, though, is the right mental model for a research-agent stack.
Contextual Retrieval: fixing the chunk that forgot where it came from
Anthropic's Contextual Retrieval prepends document-aware context to each chunk before indexing, cutting retrieval failures by up to 67%. Why it targets the exact failure that breaks RAG on filings.
RAG vs long-context: the routing trick that keeps the accuracy and cuts the bill
A Google study finds long-context LLMs beat RAG on accuracy. Its Self-Route hybrid matches long-context quality at 39-65% lower cost by sending only the hard queries to the full context.
GraphRAG: the retrieval that answers the question flat RAG cannot
Microsoft's GraphRAG builds a knowledge graph from a corpus and summarizes its communities, winning ~70-80% against naive RAG on whole-corpus questions. Why graph structure surfaces links vector search misses.
The LLM-trading-agent survey: a skeptic's reading of the backtests
A survey of LLM trading agents catalogs 15-30% returns and the shaky evaluations behind them. The median backtest runs 1.3 years, rarely counts costs, and never mentions survivorship bias.
LLMFactor: named factors from news, and the backtest that complicates them
LLMFactor extracts human-readable factors from financial news to predict stock moves. The readable factors are the real contribution; the accuracy is modest and beats baselines only half the time.
Structured Outputs: the unglamorous feature that makes LLM extraction safe to ship
OpenAI's Structured Outputs constrains generation to your JSON Schema with full adherence. Why a guarantee about shape, not a benchmark score, is what turns extraction into a system.
Mistral Large 2: the mid-sized model built for the batch job
At 123B, Mistral Large 2 lands near frontier quality at a fraction of the size. Why that efficiency, not peak capability, is what a cost- and latency-bound document pipeline wants.
FinanceBench: can RAG actually answer questions about a 10-K?
On FinanceBench, GPT-4-Turbo with a vector store got 81% of filing questions wrong or refused, while the same model with the right pages scored 85%. The bottleneck is retrieval, not the model.
RAGBench: component metrics that give a RAG answer an audit trail
RAGBench scores RAG systems on four explainable axes, and a small finetuned model beats an LLM judge at catching hallucinations. Why component-level metrics are what a regulated desk needs.
Llama 3.1 405B: a frontier model you can run behind your own firewall
Meta's 405B is the first openly available model that matches the closed frontier on knowledge, math, and code. Why that changes the build-vs-buy math for a quant desk that cannot send data to an API.
LongRAG: bigger retrieval units, fewer detached numbers
LongRAG retrieves 4K-token units instead of short passages, easing the retriever and lifting answer recall. Why coarser chunks matter for numbers buried in financial filings.
Qwen2: the open model worth self-hosting for non-English filings
Qwen2 ships five sizes up to 72B with strong math, code, and multilingual scores. Why its Chinese and Asian-language strength makes it a practical engine for a quant desk.
tau-bench: the agent reliability metric a desk cannot ignore
Sierra's tau-bench shows top agents solve a task once and then fail it on a rerun. Why pass^k is the number that decides whether an agent is safe anywhere near money.
Mixture of Agents: when a committee of open models beats one big one
An all-open-source Mixture-of-Agents stack outscored GPT-4o on AlpacaEval 2.0 with no new training. Why that is an ensembling result, what the paper's ablations prove, and where the analogy breaks.
Kolmogorov-Arnold Networks for time series: a volatility model a risk committee can read
On real implied-volatility data, T-KAN matches an LSTM with about sixty times fewer parameters and stays interpretable. The result, the architecture, and where the story gets oversold.
// Stay close to the work
Building AI that ships?
If you’re past the demo and into production, I’d love to compare notes.