// Insight
OpenAI o1: paying for intelligence at inference time
OpenAI’s o1 moves the expensive part of intelligence from training to inference. o1, released in full on December 5 after a September preview, thinks before it answers, generating a long internal chain of thought. The longer it is allowed to think, the better it does. The headline shift is not a bigger model, it is a model that spends its compute at the moment you ask rather than only when it was built. For a research desk, that turns reasoning quality into a dial you can pay to turn.
The method is reinforcement learning aimed at reasoning. o1 was trained to produce and refine a chain of thought before committing to an answer, exploring approaches and backtracking the way a careful analyst works on scratch paper. OpenAI reports a clean relationship behind it: accuracy rises with the logarithm of the compute spent thinking. More deliberation buys more correctness, with diminishing returns, which is the shape that makes the compute worth metering in the first place.
How big is the gap, really?
On the problems o1 is built for, it is not a refinement. The benchmark that captures it is competition math, where a single forward pass mostly guesses and sustained deliberation can actually work the problem.
GPT-4o solved 13% of the 2024 AIME problems. o1-preview solved 83%. The full o1 pushes higher still. On Codeforces it ranked in the 89th percentile of competitive programmers, and on graduate-level science questions it performed around the level of a PhD student. That gap is a category change, from a model that pattern-matches a plausible answer to one that reasons its way to a correct one. OpenAI made the dial explicit at release: the ChatGPT Pro tier ships a version of o1 that simply spends more compute thinking, for better answers.
Where does deliberate reasoning pay?
On the genuinely hard, multi-step problems, which on a research desk are a minority of the work but a valuable one. Generating and stress-testing a factor hypothesis, working through an unusual derivation, untangling a multi-step argument in a paper, reviewing a subtle piece of research code: these are tasks where a longer chain of thought can catch an error a quick answer would sail past. The common thread is verifiability under decomposition. A problem that can be broken into steps and checked along the way is one where more deliberation compounds into a better answer. A problem that turns on a single fact or a pure judgment call gains nothing from longer thinking beyond delay. The skill is recognizing which of your tasks have that decomposable, checkable structure, because those are the ones where the extra compute converts into correctness rather than cost.
The shift matters for how you budget. Training compute is a fixed cost paid once by the lab. Inference compute is a variable cost you pay every time you ask, which means you can target it. A hard factor-generation task can be given a large thinking budget. A routine one can be given none.
Make it concrete. Suppose you want the model to propose a new factor: an economic hypothesis, the data that would test it, and obvious ways it could be spurious. A fast model returns a plausible-sounding paragraph in a second. o1, given room to think, can work through whether the hypothesis is already captured by known factors, what confounds it, and how a careful backtest would isolate it. The first is a brainstorm. The second is closer to a junior analyst’s worked memo. That is the shape of task where the leap from 13% to 83% on competition math actually means something for research, because the work rewards the same sustained, checkable reasoning.
It helps to place o1 in a larger arc. For most of the deep-learning era, capability came from scaling training: bigger models, more data, more pretraining compute. o1 is the clearest sign yet that scaling the compute spent at inference is a second axis. The two can be traded against each other, and for a desk the practical consequence is that you no longer choose only a model. You choose how much to spend thinking, per query, as a cost-control lever rather than a fixed property of the model you picked.
Where it just burns tokens
The property that makes o1 powerful makes it wasteful on the wrong task. Most of what a desk asks a model to do is not hard reasoning. Extracting a number from a filing, classifying a headline, summarizing a paragraph: these are single-step jobs a fast, cheap model does well. A long chain of thought on them buys nothing but latency and a larger bill. o1 on a classification task is a grandmaster asked to play tic-tac-toe, slower and no more correct.
The discipline, then, is triage. Before a query reaches the expensive model, something has to decide it is worth the spend, whether a simple classifier, a routing rule, or the analyst’s own judgment. A desk that sends everything to o1 gets a large bill and slower answers for no gain on the easy majority. A desk that sends nothing to it leaves the hard problems under-served. Routing is the same cost-aware move a desk makes with every expensive resource: spend it where it pays, default to cheap everywhere else.
The costs you have to price
Three of them. A research lead should name all three before deploying. The first is latency. A long chain of thought takes real time, which rules o1 out of anything interactive or low-latency. The second is the bill. The reasoning tokens are generated, charged, and hidden. You pay for the deliberation without seeing it. The o1 API runs several times the price of GPT-4o per task. The cost is harder to predict because you do not control how long the model chooses to think.
The third is reproducibility, which a quant feels more than most. The chain of thought is not exposed. You cannot fully audit how a number was reached. A model whose reasoning you cannot inspect sits awkwardly in a process that has to be explained to a risk committee. That tension is worth naming early rather than discovering it in a review.
How I would use it
As a specialist, routed to deliberately. Keep a fast, cheap model as the default for the high-volume, single-step work, which is most of it. Reserve o1 for the genuinely hard problems where deeper reasoning earns its cost, and measure whether it actually did on your own tasks rather than assuming the benchmark transfers. Put a verification step after it regardless, because a model that reasons longer still hallucinates, it just does so more convincingly. The skill is no longer only which model to call. It is which problems deserve the expensive one.
o1 turns reasoning into a metered resource: spend more inference compute, get more accuracy. The craft is routing the expensive deliberation to the few problems that repay it and keeping a cheap model on everything else.
Working on AI that needs to ship?
I help funds, fintechs, and data teams take AI from prototype to production.