// Insights

Writing

Field notes on production AI and quantitative finance, what actually transfers from research to the desk.

FinanceAI / EngineeringData Science

Aug 5, 202510 min read

Kronos: a foundation model for candlesticks, and the scrutiny it invites

Kronos applies the language-model recipe to market data: tokenize 12 billion candlesticks, train a decoder to predict the next one, read off forecasts. The zero-shot numbers are large. The quant's job is to ask the questions a benchmark cannot answer, about leakage, regime, and whether forecast skill survives the cost of trading on it.

foundation-modelcandlesticksmarket-data

Jul 29, 20256 min read

GEPA: improving an agent by reading its traces, not its gradients

GEPA tunes an LLM system by reflecting on its own execution traces in plain language, proposing prompt edits, and keeping a frontier of what works. It beats a reinforcement-learning baseline by up to 20% with up to 35x fewer rollouts. For a desk with dozens of labeled examples rather than thousands, that sample efficiency is the whole game.

prompt-optimizationDSPyevolutionary

Jul 26, 20254 min read

FinDPO: a good idea about sentiment, and a backtest to distrust

FinDPO aligns a financial sentiment model with preference optimization instead of supervised fine-tuning, and the generalization argument is sound. Then it reports a 67% annual return at Sharpe 2.0, which is where a quant should stop nodding and start checking for look-ahead and the gap between a sentiment score and tradable alpha.

DPOsentimentalgorithmic-trading

Jun 28, 20253 min read

Qwen3: one open model with a dial between thinking and throughput

Qwen3 ships eight Apache-2.0 models, but the feature that matters for a desk is the dial: a single model that switches between deep step-by-step reasoning and fast cheap inference. Run heavy chain-of-thought for research and drop to throughput mode for signal scoring, on weights you host yourself.

open-weightsMoEreasoning

Jun 14, 20256 min read

HiREC: when every 10-K looks the same to your retriever

Standardized filings are full of near-identical boilerplate, and a flat retriever happily cites the wrong company's identical-looking risk-factor paragraph. HiREC fixes it structurally: retrieve documents before passages, then curate the evidence and ask for what is missing. It is more accurate and cheaper at once.

RAGSEC-filingsmulti-hop

Jun 7, 202511 min read

AlphaEvolve: automated discovery, and why the evaluator is the whole game

AlphaEvolve pairs Gemini with an automated evaluator in an evolutionary loop and finds things people missed, including a 4x4 matrix-multiplication algorithm better than any since 1969. For a quant the template is automated strategy discovery, and the lesson is severe: the loop optimizes your evaluator with superhuman efficiency, leaks included.

evolutionary-searchcode-generationoptimization

May 31, 20257 min read

Navigating the alpha jungle: an LLM that mines factors, and the harness it still needs

A clever framework has an LLM propose symbolic alpha formulas while Monte Carlo Tree Search refines them against backtests. The method is real. The harder half it leaves to the reader is the one a factor platform lives by: deflating for the thousands of formulas tried, and testing whether any of them survive out of sample.

alphafactor-miningMCTS

May 17, 20257 min read

Uncertainty heads: the confidence score an LLM never gave you

A small supervised head, bolted onto a frozen LLM and reading its attention, flags likely hallucinations far better than perplexity or output probability. For a regulated desk, that is the missing primitive: a calibrated per-claim confidence you can threshold into abstain-or-escalate before a number reaches a report.

uncertaintyhallucinationcalibration

May 12, 20257 min read

FinSage: compliance-grade filing QA, built from unglamorous parts

FinSage answers compliance questions over messy financial filings and beats the field, and the gains come almost entirely from engineering: chunk metadata, four retrieval paths instead of one, and a reranker preference-tuned to surface the chunks a regulator cares about. The DPO reranker is the idea to steal.

RAGcompliancefilings

May 6, 20256 min read

DeepSeek-Prover-V2: when correctness is machine-checkable

Prover-V2 proves theorems in Lean 4 by decomposing them into subgoals and training on a reward the proof checker verifies. The template matters for any quant problem where correctness can be formally checked, not just estimated.

formal-verificationreasoningreinforcement-learning

May 5, 20258 min read

Can large language models trade? A simulated market, and a warning about correlation

Lopez-Lira builds a stock market populated entirely by LLM agents and finds it reproduces the textbook stylized facts of real markets, bubbles included. The result a desk should sit with is the one about correlation: shared models trade alike, and alike is how liquidity disappears.

agent-based-modelsmarket-simulationsystemic-risk

Apr 20, 202510 min read

Does RL really incentivize reasoning? A caution for the backtest

A sober study finds RL makes reasoning models better at the first try without expanding what they can ultimately solve. The quant analogy is exact: do not mistake variance reduction for alpha, in a model or in a trading agent.

reinforcement-learningreasoningevaluation

Mar 22, 20253 min read

smolagents: when the agent's action is code

Hugging Face's smolagents is a minimal agent library whose agents write their actions as executable Python, not JSON tool calls. For quant work, where the action often is code, it is a natural and lightweight scaffold.

agentstoolingcode-execution

Mar 15, 20253 min read

vLLM V1: the unglamorous economics of serving your own models

vLLM's V1 re-architecture cuts inference cost up to 1.7x with a cleaner core and prefix caching on by default. For a shop self-hosting LLMs, serving throughput is the line item that decides whether in-house pays.

inferenceservingMLOps

Mar 9, 202510 min read

Granular metric extraction from filings: traceability and verification beyond summarization

Clients want an agent that reads the 10-K and returns the number. Extraction, not summarization, is the hard part, and benchmarks say models fail it more than half the time. The build guide for doing it with a source on every figure and a verification gate.

extractionfinancial-documentsverification

Mar 1, 20256 min read

FinRL-DeepSeek: an LLM news signal wired into a risk-aware RL agent

A reproducible, open template that turns financial news into an LLM signal and feeds it to a CVaR-aware reinforcement-learning allocator. The hybrid every discretionary-plus-systematic desk sketches, with the code attached.

reinforcement-learningLLM-signalstrading

Feb 21, 20256 min read

GPT as a sell-side analyst: where it beats the human, and where it folds

A sober read on the 'AI replaces analysts' headline. An LLM reads an earnings release like an analyst but reasons through the numbers unevenly, and the useful finding is a diagnostic for when to trust its forecast.

LLM-evaluationequity-researchforecasting

Feb 18, 20257 min read

Native Sparse Attention: cheaper long context, trained in from the start

NSA makes long-context attention fast by building sparsity into training rather than pruning it in afterward. Up to 11.6x faster at 64k while beating full attention, and why that is the enabling layer for document-heavy finance.

long-contextattentionefficiency

Feb 15, 20253 min read

Kimi k1.5: the second proof that RL makes reasoning

Around the same time as DeepSeek-R1, Moonshot's Kimi k1.5 reached o1-level reasoning with reinforcement learning by a different route. Two independent recipes in one month make the technique a method, not a fluke.

reasoningreinforcement-learninglong-context

Feb 4, 20257 min read

s1: buying reasoning with a budget you control

s1 fine-tunes an open 32B model on 1,000 examples and adds budget forcing, a dial that makes it think longer by appending 'Wait'. Why a controllable, auditable inference-compute knob matters to a quant.

test-time-computereasoningopen-models

Feb 3, 202510 min read

DeepSeek-R1: frontier reasoning goes open

R1 matches OpenAI's o1 on hard math and code, ships openly, and distills into small models you can host. Why the distillation result, not the benchmark parity, is what changes build-vs-buy for a quant desk.

reasoningopen-weightsreinforcement-learning

Jan 25, 20256 min read

Agentic RAG: when retrieval learns to loop

Agentic RAG replaces one-shot retrieve-then-generate with a loop that plans, retrieves, critiques, and iterates. A map of the patterns, and a blueprint for a research assistant that can catch its own bad retrieval.

RAGagentsretrieval

Jan 6, 202510 min read

RAG for financial documents: a field guide

Grounding an LLM in your own filings is hard because retrieval, not the model, is the bottleneck. The proven moves that fix it, each with the evidence attached, and the discipline that makes the result safe to use.

RAGretrievalfinancial-documents

Dec 29, 20246 min read

OpenAI o1: paying for intelligence at inference time

o1 shifts the expensive part of reasoning from training to inference, thinking in hidden tokens before it answers. Where deliberate, costly reasoning pays for a research desk, and where it just burns tokens.

reasoningtest-time-computeLLMs

Dec 15, 20247 min read

LOBDIF: diffusion models reach the order book

LOBDIF applies a diffusion model to the limit order book, denoising the next event's timing and type from noise. A genuine frontier-ML crossover, and a critical look at whether it beats the point processes it wants to replace.

microstructurediffusion-modelsorder-book

Dec 7, 202411 min read

Model Context Protocol: the integration layer finally gets a standard

Anthropic's MCP is an open protocol that lets any model reach any data source or tool through one interface. Why a standard, modeled on LSP, is what a quant platform's integration layer has been missing.

MCPintegrationopen-standard

Nov 30, 20243 min read

Tülu 3: an open recipe for post-training your own model

Tülu 3 releases the full post-training stack, data, code, recipes, and RLVR, on Llama 3.1. Why a reproducible recipe for training on verifiable rewards matters for a quant's own checkable tasks.

post-trainingRLVRopen-recipe

Nov 25, 20244 min read

Transformer covariance for ETFs: the right target, the missing evidence

A working paper forecasts semi-covariance with transformers for downside-aware ETF allocation. The idea hits the real weak link in mean-variance. The evidence is one month long, with no costs, no turnover, and no shrinkage baseline.

covariancetransformersasset-allocation

Nov 21, 20243 min read

Why generic embeddings cap your financial RAG

Finance-tuned BAM embeddings hit Recall@1 of 62.8% against 39.2% for the best general model, and lift FinanceBench accuracy by 8%. The retrieval ceiling is the embedding model nobody swaps.

embeddingsretrievalfinance-NLP

Nov 17, 20243 min read

OLMo 2: the open model a risk committee can actually audit

OLMo 2 releases not just weights but the training data, code, checkpoints, and eval harness. Why that full transparency is the ingredient model-risk governance has been missing.

open-sciencereproducibilitymodel-risk

Nov 9, 20243 min read

QwQ-32B: frontier-style reasoning you can self-host

QwQ-32B-Preview is a 32B Apache-2.0 reasoning model scoring 90.6% on MATH-500. Why a self-hostable reasoner changes what a compliance-bound quant team can run on its own data.

open-weightsreasoningRLVR

Nov 2, 20246 min read

FrontierMath: the math benchmark that is not saturated, and what that tells a quant

Frontier models near-perfect GSM8K and MATH, yet solve under 2% of FrontierMath's research-level problems. A sober gauge of how far to trust an LLM on a hard derivation.

benchmarkmath-reasoningevaluation

Oct 19, 20246 min read

LightRAG: graph retrieval that updates without a teardown

LightRAG keeps GraphRAG's cross-document reach but updates incrementally and retrieves for a fraction of the cost. Why incremental graph updates fit the constantly-arriving corpora a desk actually has.

RAGknowledge-graphretrieval

Oct 15, 202410 min read

OpenAI Swarm: a teaching toy with a lesson worth stealing

Swarm is an experimental, MIT-licensed framework built on two primitives, agents and handoffs. It is not for production. The handoff pattern, though, is the right mental model for a research-agent stack.

multi-agentorchestrationopen-source

Oct 7, 20246 min read

Contextual Retrieval: fixing the chunk that forgot where it came from

Anthropic's Contextual Retrieval prepends document-aware context to each chunk before indexing, cutting retrieval failures by up to 67%. Why it targets the exact failure that breaks RAG on filings.

RAGretrievalchunking

Sep 28, 20246 min read

RAG vs long-context: the routing trick that keeps the accuracy and cuts the bill

A Google study finds long-context LLMs beat RAG on accuracy. Its Self-Route hybrid matches long-context quality at 39-65% lower cost by sending only the hard queries to the full context.

RAGlong-contextcost

Sep 23, 20246 min read

GraphRAG: the retrieval that answers the question flat RAG cannot

Microsoft's GraphRAG builds a knowledge graph from a corpus and summarizes its communities, winning ~70-80% against naive RAG on whole-corpus questions. Why graph structure surfaces links vector search misses.

RAGknowledge-graphopen-source

Sep 9, 20243 min read

The LLM-trading-agent survey: a skeptic's reading of the backtests

A survey of LLM trading agents catalogs 15-30% returns and the shaky evaluations behind them. The median backtest runs 1.3 years, rarely counts costs, and never mentions survivorship bias.

surveytradingagents

Sep 2, 20246 min read

LLMFactor: named factors from news, and the backtest that complicates them

LLMFactor extracts human-readable factors from financial news to predict stock moves. The readable factors are the real contribution; the accuracy is modest and beats baselines only half the time.

NLPfactorsnews

Aug 27, 20244 min read

Structured Outputs: the unglamorous feature that makes LLM extraction safe to ship

OpenAI's Structured Outputs constrains generation to your JSON Schema with full adherence. Why a guarantee about shape, not a benchmark score, is what turns extraction into a system.

structured-outputJSON-schemareliability

Aug 27, 20243 min read

Mistral Large 2: the mid-sized model built for the batch job

At 123B, Mistral Large 2 lands near frontier quality at a fraction of the size. Why that efficiency, not peak capability, is what a cost- and latency-bound document pipeline wants.

Mistralopen-weightsefficiency

Aug 19, 20246 min read

FinanceBench: can RAG actually answer questions about a 10-K?

On FinanceBench, GPT-4-Turbo with a vector store got 81% of filing questions wrong or refused, while the same model with the right pages scored 85%. The bottleneck is retrieval, not the model.

RAGfinancial-QAbenchmark

Aug 12, 20243 min read

RAGBench: component metrics that give a RAG answer an audit trail

RAGBench scores RAG systems on four explainable axes, and a small finetuned model beats an LLM judge at catching hallucinations. Why component-level metrics are what a regulated desk needs.

RAGevaluationgroundedness

Aug 4, 202410 min read

Llama 3.1 405B: a frontier model you can run behind your own firewall

Meta's 405B is the first openly available model that matches the closed frontier on knowledge, math, and code. Why that changes the build-vs-buy math for a quant desk that cannot send data to an API.

open-weightsfrontierLlama

Jul 30, 20243 min read

LongRAG: bigger retrieval units, fewer detached numbers

LongRAG retrieves 4K-token units instead of short passages, easing the retriever and lifting answer recall. Why coarser chunks matter for numbers buried in financial filings.

RAGlong-contextretrieval

Jul 13, 20243 min read

Qwen2: the open model worth self-hosting for non-English filings

Qwen2 ships five sizes up to 72B with strong math, code, and multilingual scores. Why its Chinese and Asian-language strength makes it a practical engine for a quant desk.

open-weightsQwenmultilingual

Jul 7, 20246 min read

tau-bench: the agent reliability metric a desk cannot ignore

Sierra's tau-bench shows top agents solve a task once and then fail it on a rerun. Why pass^k is the number that decides whether an agent is safe anywhere near money.

agentsevaluationreliability

Jun 23, 20247 min read

Mixture of Agents: when a committee of open models beats one big one

An all-open-source Mixture-of-Agents stack outscored GPT-4o on AlpacaEval 2.0 with no new training. Why that is an ensembling result, what the paper's ablations prove, and where the analogy breaks.

ensemblingagentsopen-source

Jun 15, 202411 min read

Kolmogorov-Arnold Networks for time series: a volatility model a risk committee can read

On real implied-volatility data, T-KAN matches an LSTM with about sixty times fewer parameters and stays interpretable. The result, the architecture, and where the story gets oversold.

KANforecastinginterpretability

// Stay close to the work

Building AI that ships?

If you’re past the demo and into production, I’d love to compare notes.

Get in touch Read the book