Skip to content
Tim Frenzel

// Insights

Writing

Field notes on production AI and quantitative finance, what actually transfers from research to the desk.

FinanceAI / EngineeringData Science
10 min read

Kronos: a foundation model for candlesticks, and the scrutiny it invites

Kronos applies the language-model recipe to market data: tokenize 12 billion candlesticks, train a decoder to predict the next one, read off forecasts. The zero-shot numbers are large. The quant's job is to ask the questions a benchmark cannot answer, about leakage, regime, and whether forecast skill survives the cost of trading on it.

foundation-modelcandlesticksmarket-data
6 min read

GEPA: improving an agent by reading its traces, not its gradients

GEPA tunes an LLM system by reflecting on its own execution traces in plain language, proposing prompt edits, and keeping a frontier of what works. It beats a reinforcement-learning baseline by up to 20% with up to 35x fewer rollouts. For a desk with dozens of labeled examples rather than thousands, that sample efficiency is the whole game.

prompt-optimizationDSPyevolutionary
4 min read

FinDPO: a good idea about sentiment, and a backtest to distrust

FinDPO aligns a financial sentiment model with preference optimization instead of supervised fine-tuning, and the generalization argument is sound. Then it reports a 67% annual return at Sharpe 2.0, which is where a quant should stop nodding and start checking for look-ahead and the gap between a sentiment score and tradable alpha.

DPOsentimentalgorithmic-trading
3 min read

Qwen3: one open model with a dial between thinking and throughput

Qwen3 ships eight Apache-2.0 models, but the feature that matters for a desk is the dial: a single model that switches between deep step-by-step reasoning and fast cheap inference. Run heavy chain-of-thought for research and drop to throughput mode for signal scoring, on weights you host yourself.

open-weightsMoEreasoning
6 min read

HiREC: when every 10-K looks the same to your retriever

Standardized filings are full of near-identical boilerplate, and a flat retriever happily cites the wrong company's identical-looking risk-factor paragraph. HiREC fixes it structurally: retrieve documents before passages, then curate the evidence and ask for what is missing. It is more accurate and cheaper at once.

RAGSEC-filingsmulti-hop
11 min read

AlphaEvolve: automated discovery, and why the evaluator is the whole game

AlphaEvolve pairs Gemini with an automated evaluator in an evolutionary loop and finds things people missed, including a 4x4 matrix-multiplication algorithm better than any since 1969. For a quant the template is automated strategy discovery, and the lesson is severe: the loop optimizes your evaluator with superhuman efficiency, leaks included.

evolutionary-searchcode-generationoptimization
7 min read

Navigating the alpha jungle: an LLM that mines factors, and the harness it still needs

A clever framework has an LLM propose symbolic alpha formulas while Monte Carlo Tree Search refines them against backtests. The method is real. The harder half it leaves to the reader is the one a factor platform lives by: deflating for the thousands of formulas tried, and testing whether any of them survive out of sample.

alphafactor-miningMCTS
7 min read

Uncertainty heads: the confidence score an LLM never gave you

A small supervised head, bolted onto a frozen LLM and reading its attention, flags likely hallucinations far better than perplexity or output probability. For a regulated desk, that is the missing primitive: a calibrated per-claim confidence you can threshold into abstain-or-escalate before a number reaches a report.

uncertaintyhallucinationcalibration
7 min read

FinSage: compliance-grade filing QA, built from unglamorous parts

FinSage answers compliance questions over messy financial filings and beats the field, and the gains come almost entirely from engineering: chunk metadata, four retrieval paths instead of one, and a reranker preference-tuned to surface the chunks a regulator cares about. The DPO reranker is the idea to steal.

RAGcompliancefilings
6 min read

DeepSeek-Prover-V2: when correctness is machine-checkable

Prover-V2 proves theorems in Lean 4 by decomposing them into subgoals and training on a reward the proof checker verifies. The template matters for any quant problem where correctness can be formally checked, not just estimated.

formal-verificationreasoningreinforcement-learning
8 min read

Can large language models trade? A simulated market, and a warning about correlation

Lopez-Lira builds a stock market populated entirely by LLM agents and finds it reproduces the textbook stylized facts of real markets, bubbles included. The result a desk should sit with is the one about correlation: shared models trade alike, and alike is how liquidity disappears.

agent-based-modelsmarket-simulationsystemic-risk
10 min read

Does RL really incentivize reasoning? A caution for the backtest

A sober study finds RL makes reasoning models better at the first try without expanding what they can ultimately solve. The quant analogy is exact: do not mistake variance reduction for alpha, in a model or in a trading agent.

reinforcement-learningreasoningevaluation
3 min read

smolagents: when the agent's action is code

Hugging Face's smolagents is a minimal agent library whose agents write their actions as executable Python, not JSON tool calls. For quant work, where the action often is code, it is a natural and lightweight scaffold.

agentstoolingcode-execution
3 min read

vLLM V1: the unglamorous economics of serving your own models

vLLM's V1 re-architecture cuts inference cost up to 1.7x with a cleaner core and prefix caching on by default. For a shop self-hosting LLMs, serving throughput is the line item that decides whether in-house pays.

inferenceservingMLOps
10 min read

Granular metric extraction from filings: traceability and verification beyond summarization

Clients want an agent that reads the 10-K and returns the number. Extraction, not summarization, is the hard part, and benchmarks say models fail it more than half the time. The build guide for doing it with a source on every figure and a verification gate.

extractionfinancial-documentsverification
6 min read

FinRL-DeepSeek: an LLM news signal wired into a risk-aware RL agent

A reproducible, open template that turns financial news into an LLM signal and feeds it to a CVaR-aware reinforcement-learning allocator. The hybrid every discretionary-plus-systematic desk sketches, with the code attached.

reinforcement-learningLLM-signalstrading
6 min read

GPT as a sell-side analyst: where it beats the human, and where it folds

A sober read on the 'AI replaces analysts' headline. An LLM reads an earnings release like an analyst but reasons through the numbers unevenly, and the useful finding is a diagnostic for when to trust its forecast.

LLM-evaluationequity-researchforecasting
7 min read

Native Sparse Attention: cheaper long context, trained in from the start

NSA makes long-context attention fast by building sparsity into training rather than pruning it in afterward. Up to 11.6x faster at 64k while beating full attention, and why that is the enabling layer for document-heavy finance.

long-contextattentionefficiency
3 min read

Kimi k1.5: the second proof that RL makes reasoning

Around the same time as DeepSeek-R1, Moonshot's Kimi k1.5 reached o1-level reasoning with reinforcement learning by a different route. Two independent recipes in one month make the technique a method, not a fluke.

reasoningreinforcement-learninglong-context
7 min read

s1: buying reasoning with a budget you control

s1 fine-tunes an open 32B model on 1,000 examples and adds budget forcing, a dial that makes it think longer by appending 'Wait'. Why a controllable, auditable inference-compute knob matters to a quant.

test-time-computereasoningopen-models
10 min read

DeepSeek-R1: frontier reasoning goes open

R1 matches OpenAI's o1 on hard math and code, ships openly, and distills into small models you can host. Why the distillation result, not the benchmark parity, is what changes build-vs-buy for a quant desk.

reasoningopen-weightsreinforcement-learning
6 min read

Agentic RAG: when retrieval learns to loop

Agentic RAG replaces one-shot retrieve-then-generate with a loop that plans, retrieves, critiques, and iterates. A map of the patterns, and a blueprint for a research assistant that can catch its own bad retrieval.

RAGagentsretrieval
10 min read

RAG for financial documents: a field guide

Grounding an LLM in your own filings is hard because retrieval, not the model, is the bottleneck. The proven moves that fix it, each with the evidence attached, and the discipline that makes the result safe to use.

RAGretrievalfinancial-documents
6 min read

OpenAI o1: paying for intelligence at inference time

o1 shifts the expensive part of reasoning from training to inference, thinking in hidden tokens before it answers. Where deliberate, costly reasoning pays for a research desk, and where it just burns tokens.

reasoningtest-time-computeLLMs
7 min read

LOBDIF: diffusion models reach the order book

LOBDIF applies a diffusion model to the limit order book, denoising the next event's timing and type from noise. A genuine frontier-ML crossover, and a critical look at whether it beats the point processes it wants to replace.

microstructurediffusion-modelsorder-book
11 min read

Model Context Protocol: the integration layer finally gets a standard

Anthropic's MCP is an open protocol that lets any model reach any data source or tool through one interface. Why a standard, modeled on LSP, is what a quant platform's integration layer has been missing.

MCPintegrationopen-standard
3 min read

Tülu 3: an open recipe for post-training your own model

Tülu 3 releases the full post-training stack, data, code, recipes, and RLVR, on Llama 3.1. Why a reproducible recipe for training on verifiable rewards matters for a quant's own checkable tasks.

post-trainingRLVRopen-recipe
4 min read

Transformer covariance for ETFs: the right target, the missing evidence

A working paper forecasts semi-covariance with transformers for downside-aware ETF allocation. The idea hits the real weak link in mean-variance. The evidence is one month long, with no costs, no turnover, and no shrinkage baseline.

covariancetransformersasset-allocation
3 min read

Why generic embeddings cap your financial RAG

Finance-tuned BAM embeddings hit Recall@1 of 62.8% against 39.2% for the best general model, and lift FinanceBench accuracy by 8%. The retrieval ceiling is the embedding model nobody swaps.

embeddingsretrievalfinance-NLP
3 min read

OLMo 2: the open model a risk committee can actually audit

OLMo 2 releases not just weights but the training data, code, checkpoints, and eval harness. Why that full transparency is the ingredient model-risk governance has been missing.

open-sciencereproducibilitymodel-risk
3 min read

QwQ-32B: frontier-style reasoning you can self-host

QwQ-32B-Preview is a 32B Apache-2.0 reasoning model scoring 90.6% on MATH-500. Why a self-hostable reasoner changes what a compliance-bound quant team can run on its own data.

open-weightsreasoningRLVR
6 min read

FrontierMath: the math benchmark that is not saturated, and what that tells a quant

Frontier models near-perfect GSM8K and MATH, yet solve under 2% of FrontierMath's research-level problems. A sober gauge of how far to trust an LLM on a hard derivation.

benchmarkmath-reasoningevaluation
6 min read

LightRAG: graph retrieval that updates without a teardown

LightRAG keeps GraphRAG's cross-document reach but updates incrementally and retrieves for a fraction of the cost. Why incremental graph updates fit the constantly-arriving corpora a desk actually has.

RAGknowledge-graphretrieval
10 min read

OpenAI Swarm: a teaching toy with a lesson worth stealing

Swarm is an experimental, MIT-licensed framework built on two primitives, agents and handoffs. It is not for production. The handoff pattern, though, is the right mental model for a research-agent stack.

multi-agentorchestrationopen-source
6 min read

Contextual Retrieval: fixing the chunk that forgot where it came from

Anthropic's Contextual Retrieval prepends document-aware context to each chunk before indexing, cutting retrieval failures by up to 67%. Why it targets the exact failure that breaks RAG on filings.

RAGretrievalchunking
6 min read

RAG vs long-context: the routing trick that keeps the accuracy and cuts the bill

A Google study finds long-context LLMs beat RAG on accuracy. Its Self-Route hybrid matches long-context quality at 39-65% lower cost by sending only the hard queries to the full context.

RAGlong-contextcost
6 min read

GraphRAG: the retrieval that answers the question flat RAG cannot

Microsoft's GraphRAG builds a knowledge graph from a corpus and summarizes its communities, winning ~70-80% against naive RAG on whole-corpus questions. Why graph structure surfaces links vector search misses.

RAGknowledge-graphopen-source
3 min read

The LLM-trading-agent survey: a skeptic's reading of the backtests

A survey of LLM trading agents catalogs 15-30% returns and the shaky evaluations behind them. The median backtest runs 1.3 years, rarely counts costs, and never mentions survivorship bias.

surveytradingagents
6 min read

LLMFactor: named factors from news, and the backtest that complicates them

LLMFactor extracts human-readable factors from financial news to predict stock moves. The readable factors are the real contribution; the accuracy is modest and beats baselines only half the time.

NLPfactorsnews
4 min read

Structured Outputs: the unglamorous feature that makes LLM extraction safe to ship

OpenAI's Structured Outputs constrains generation to your JSON Schema with full adherence. Why a guarantee about shape, not a benchmark score, is what turns extraction into a system.

structured-outputJSON-schemareliability
3 min read

Mistral Large 2: the mid-sized model built for the batch job

At 123B, Mistral Large 2 lands near frontier quality at a fraction of the size. Why that efficiency, not peak capability, is what a cost- and latency-bound document pipeline wants.

Mistralopen-weightsefficiency
6 min read

FinanceBench: can RAG actually answer questions about a 10-K?

On FinanceBench, GPT-4-Turbo with a vector store got 81% of filing questions wrong or refused, while the same model with the right pages scored 85%. The bottleneck is retrieval, not the model.

RAGfinancial-QAbenchmark
3 min read

RAGBench: component metrics that give a RAG answer an audit trail

RAGBench scores RAG systems on four explainable axes, and a small finetuned model beats an LLM judge at catching hallucinations. Why component-level metrics are what a regulated desk needs.

RAGevaluationgroundedness
10 min read

Llama 3.1 405B: a frontier model you can run behind your own firewall

Meta's 405B is the first openly available model that matches the closed frontier on knowledge, math, and code. Why that changes the build-vs-buy math for a quant desk that cannot send data to an API.

open-weightsfrontierLlama
3 min read

LongRAG: bigger retrieval units, fewer detached numbers

LongRAG retrieves 4K-token units instead of short passages, easing the retriever and lifting answer recall. Why coarser chunks matter for numbers buried in financial filings.

RAGlong-contextretrieval
3 min read

Qwen2: the open model worth self-hosting for non-English filings

Qwen2 ships five sizes up to 72B with strong math, code, and multilingual scores. Why its Chinese and Asian-language strength makes it a practical engine for a quant desk.

open-weightsQwenmultilingual
6 min read

tau-bench: the agent reliability metric a desk cannot ignore

Sierra's tau-bench shows top agents solve a task once and then fail it on a rerun. Why pass^k is the number that decides whether an agent is safe anywhere near money.

agentsevaluationreliability
7 min read

Mixture of Agents: when a committee of open models beats one big one

An all-open-source Mixture-of-Agents stack outscored GPT-4o on AlpacaEval 2.0 with no new training. Why that is an ensembling result, what the paper's ablations prove, and where the analogy breaks.

ensemblingagentsopen-source
11 min read

Kolmogorov-Arnold Networks for time series: a volatility model a risk committee can read

On real implied-volatility data, T-KAN matches an LSTM with about sixty times fewer parameters and stays interpretable. The result, the architecture, and where the story gets oversold.

KANforecastinginterpretability

// Stay close to the work

Building AI that ships?

If you’re past the demo and into production, I’d love to compare notes.