What RL Actually Does to Language Models, at the Token Level

0:00Brooks: One hundred three thousand dollars. Twenty-five dollars. Same model, same benchmark, basically the same accuracy. The first number is what it costs to train a thirty-two-billion-parameter language model with reinforcement learning to be better at math. The second is what it costs to match that result without doing the reinforcement learning at all.

0:22Jessica: That's a four-thousand-fold gap, and it lands in a paper posted to arXiv on May eighth, twenty-twenty-six, called "Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning." We're recording the day after. Quick note before we dig in: this episode is AI-generated. The script came from Anthropic's Claude Opus 4.7. I'm Jessica, and the other voice you're hearing is Brooks — we're both AI voices from Eleven Labs, and the show is produced independently of both companies.

0:53Brooks: And the reason that four-thousand-fold gap matters isn't just that one team found a cheaper recipe. It's that the cheaper recipe works because reinforcement learning, in this setting, was apparently doing something much smaller than anyone realized.

1:09Jessica: So let me set up what people thought was happening. For the past two years, the dominant story for making language models better at math and code has been: take a base model, layer reinforcement learning on top. DeepSeek-R1, OpenAI's o1, Qwen3 — they all use some variant of this pipeline. And the implicit framing has been an AlphaGo framing. RL gave us an agent that played moves no human had played. By analogy, surely RL is teaching language models genuinely new ways to reason — discovering reasoning paths the base model couldn't find on its own.

1:44Brooks: That's the seductive version, and it justifies a lot of compute spend. Tens of thousands of GPU hours, distributed PPO infrastructure, careful KL coefficients, all the machinery. If RL is unlocking new capabilities, that price is reasonable.

1:59Jessica: Right. But there's been a quiet body of evidence undermining that picture. A paper from Yue and colleagues last year showed that when you sample many times from a base model — pass at k, with k cranked up — the base model often already contains the correct solution. RL doesn't expand the set of things the model can do; it just makes it commit to the right answer more reliably. Davis and Recht proved a related result: with binary rewards, the popular RL algorithms mathematically reduce to gradient ascent on the probability of a correct answer. So RL can only profit when the base model already succeeds sometimes. And Wang and colleagues noticed that whatever RL does, it concentrates at high-entropy positions where the model is uncertain.

2:48Brooks: So the field had three pieces of indirect evidence pointing the same direction: RL is reweighting things the base model already does. What this paper does — the authors are at USC and ARL, led by Ömer Faruk Akgül with Kannan, Neiswanger, and Prasanna — what they do is ask the obvious question nobody had pinned down. If RL is steering the model toward paths it already knows, can we just look at exactly what it changes? At the token level?

3:17Jessica: And the answer is striking enough that I want to walk through it the way they did, because the argument has this unusually clean shape — observation, then causal test, then a method that drops out at the end.

3:31Brooks: Start with the observation, then.

3:33Jessica: Okay. Take a base model and its RL-trained counterpart. They tested four such pairs across model sizes — Qwen two-point-five at one-point-five and seven billion parameters, Qwen3 at four billion, with both GRPO and PPO as the RL algorithm. Run them both on the same math problems. Compare them token by token: at each position, what word does the base model think is most likely? What word does the RL-trained model think is most likely? The result: ninety-seven to ninety-nine percent of the time, they agree exactly. Same word, same position, no disagreement. The vast majority of what the RL model produces is identical to what the base model would have produced.

4:19Brooks: Which already starts to puncture the AlphaGo framing. AlphaGo wasn't agreeing with a human player ninety-eight percent of the time. The strategies were genuinely different.

4:31Jessica: Exactly. And here's where it gets sharper. The one-to-three percent of positions where they do disagree — what kind of disagreement is it? You might imagine the RL model has discovered some clever new word, something the base model wouldn't have considered. That's not what happens. At those disagreement positions, the RL model's chosen word is, on average, the base model's second choice. Sometimes third. It's almost never outside the base model's top five candidates. The rate of "shifted" tokens outside the base's top-five entirely — is essentially zero. Hundredths of a percent.

5:11Brooks: So at those handful of moments where they diverge, RL isn't introducing new vocabulary or new ideas. It's just promoting something the base model was already considering.

5:23Jessica: And the third piece of the observation — those disagreement positions aren't randomly scattered. They concentrate at moments of high entropy. Seven to twelve times higher entropy than the typical position in the same sequence. Entropy here is just a measure of how spread out the model's probability distribution is over its next-word choices. Low entropy means the model is ninety-nine percent sure of one word. High entropy means it's split across multiple plausible options.

5:54Brooks: It's the model hesitating.

5:55Jessica: Right. And the disagreements happen exactly at those hesitation points. There's a worked example in the paper that makes this concrete. The problem is: how many numbers from one to thirty are divisible by two, three, or five? The model writes out a multi-step solution — set up the inclusion-exclusion principle, count multiples of each, subtract overlaps, give a final answer. The base model's solution and the RL model's solution are word-for-word identical for ninety-eight percent of the tokens. The places they differ: a couple of structural words like "verify" and "subtract," and the final answer itself, which is twenty-two. Right at the moments of greatest uncertainty in the reasoning chain — does the model double-check its work, does it commit to the right arithmetic, does it land on the right number — that's where the RL model's preference quietly diverges.

6:51Brooks: So the picture from the observation alone: RL isn't teaching the model new tricks. It's editing maybe two percent of tokens, all of which were already on the model's shortlist, all concentrated at moments where the model was at a fork in the road.

7:07Jessica: That's the picture from the observation. But here's the move that turns this from suggestive to convincing. Brooks, this is the part you wanted to land carefully.

7:18Brooks: Yeah, because this is where a careful reader would say: hold on. You've shown the two models differ at certain positions. You haven't shown those differences are what matters. Maybe ninety-eight percent of the work is being done somewhere else and you just can't see it because greedy decoding masks subtler probabilistic effects. Maybe the disagreement positions are a symptom, not the cause.

7:42Jessica: Right. So how do you tell?

7:44Brooks: You run a counterfactual. The authors do something I'd call surgical. They take the base model. They run it deterministically. And at every position where the RL model would have disagreed — but only those positions — they reach in and substitute the RL model's preferred token. Then let the base model continue from there. They call this the oracle intervention, because it requires knowing in advance which positions the RL model would have changed. When they do this, the patched base model recovers RL's full accuracy on pass at one. Not partially. Exactly. Whatever benefit RL was providing is fully captured by changing those specific tokens at those specific positions.

8:25Jessica: Which is suggestive but not yet a clean causal claim.

8:29Brooks: Right. And this is where the paper earns its keep. They run a control. Same number of edits. Same positions. But instead of substituting the RL model's preferred token, they substitute a random token sampled from the base model's top twenty alternatives at that position. So you've changed the same number of words, at the same locations, with comparable replacements — just not the *specific* replacements RL would have chosen. The result is that the random control performs the same as the base model, or worse. The accuracy gain disappears entirely.

9:04Jessica: That's the line between correlation and causation, right there.

9:08Brooks: It's clean. The story isn't "edits at these positions help" — it's "the right edits at these positions help, and there are very few right edits." The location accounts for some of the structure — these are the consequential forks — but the specific token at the fork is what carries the benefit. And there are vanishingly few such tokens. One to three percent of the sequence.

9:31Jessica: I think this is the moment where the paper's central claim clicks. Reinforcement learning, in this setting, isn't teaching the model to reason. It's making a small number of correct turns at intersections the model was already at.

9:45Brooks: And that reframes a lot. The AlphaGo analogy implied discovery — the agent finding moves the prior didn't have. This is closer to a calibration problem. The pretrained model already has the reasoning capability latent in its distribution; what RL does is commit it to the right branch at the handful of moments where the branches are close in probability.

10:07Jessica: There's a phrase from the paper I want to land here: "RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains." That's the one-sentence version of what this whole token-level analysis is showing.

10:23Brooks: And it sets up the natural follow-up. If the correction is this small and this localized, why are we doing the expensive thing to produce it?

10:31Jessica: Which is where the paper pivots to the constructive half. They've shown the correction is sparse. Now: can you reproduce it without doing RL?

10:40Brooks: They take this in two steps, and I think the first one is worth pausing on, because it's another piece of evidence about how small this correction actually is. The oracle intervention used the RL model itself to identify *where* to change tokens. That's still cheating, in a sense — you needed an RL-trained teacher to find the spots. So they ask: can you find the right positions without the teacher?

11:05Jessica: And the answer is: just use entropy. Take the base model alone. At every position, compute the entropy of its next-token distribution. If that entropy crosses a threshold — they use one-point-two — call that a decision point. Apply the same intervention logic at those positions. They call this entropy-gated, and on the seven-billion GRPO pair it matches the oracle exactly. Touches between one and eight percent of tokens, depending on the model.

11:35Brooks: So the *where* of RL's correction is fully predictable from the base model alone. You don't need an RL-trained teacher to know which tokens are the consequential ones. The model's own uncertainty signal tells you.

11:49Jessica: That's the second domino. The first was: RL's correction is sparse and lives at predictable positions. The second is: those positions can be located without RL. And there's a third piece — the parameter side. They train a tiny LoRA adapter — that's a low-rank adapter, the standard parameter-efficient fine-tuning trick where you freeze the original model and train a small set of additional matrices that ride on top — and they use it to mimic the RL model on a hundred problems. The adapter is something like a third of one percent of the model's parameters. And it reproduces RL's accuracy. Even an extreme version — a rank-eight adapter on just the output projection layer of the one-point-five-billion model, around six hundred eighty-eight thousand parameters out of one-point-five billion — gets within a point of full RL on MATH-500.

12:44Brooks: So the correction isn't just sparse in token space. It's also low-dimensional in parameter space. Whatever RL is doing fits into a tiny patch.

12:53Jessica: And those three pieces of evidence — sparse in tokens, locatable from entropy, expressible in a tiny parameter footprint — that's the full setup for the method.

13:04Brooks: Which they call ReasonMaxxer.

13:06Jessica: Yeah. And the construction is satisfying because every choice is justified by something earlier in the paper. Take a base model. Pick a small set of math problems — fifty of them. For each problem, sample a handful of solution rollouts from the base model itself. No RL teacher anywhere in the pipeline. Then filter the problems: keep only the ones where the base model has mixed success — sometimes gets it right, sometimes wrong.

13:35Brooks: Why filter that way?

13:36Jessica: Because if the base model always succeeds, there's nothing to learn from that problem. And if it always fails, there's no positive example to push toward. They're targeting what they call the edge of competence. The contrastive signal only exists on problems where some rollouts work and some don't.

13:56Brooks: And the actual loss?

13:58Jessica: At every position in those rollouts, look at the base model's entropy. If it's below threshold, anchor the model with a KL term to its frozen base behavior — leave that position alone. If it's above threshold, apply a contrastive loss: push up the probability of tokens that came from rollouts that ended in correct answers, push down tokens that came from rollouts that ended in incorrect ones. Train a LoRA for one epoch on a single GPU.

14:27Brooks: So at the forks, lean toward the choices that historically led to right answers. At the non-forks, don't change anything.

14:35Jessica: That's the whole method.

14:37Brooks: And the result?

14:38Jessica: Across six math benchmarks — MATH-500, GSM8K, AMC, AIME, Minerva, OlympiadBench — and across model families and sizes, ReasonMaxxer matches or exceeds full RL pipelines on the majority of comparisons. The cost numbers are the part to dwell on.

14:54Brooks: This is where the dollar figures land. Let me take the cleanest comparison, because the paper has a table with maybe a dozen rows and it's easy to drown in. The thirty-two-billion-parameter model. There's a published RL pipeline called Open-Reasoner-Zero that does PPO at scale on this model. The estimated cost for that training run is around one hundred three thousand dollars. It achieves an average benchmark score of forty-three-point-six percent. ReasonMaxxer on the same base model: twenty-five dollars. Forty-four percent average. Slightly higher. About four thousand times cheaper.

15:32Jessica: And it isn't a one-off. At seven billion parameters, against a baseline called SimpleRL-Zoo that uses GRPO on roughly eight thousand math problems and costs in the hundreds to low thousands of dollars — ReasonMaxxer trains on fifty problems for around four dollars and is competitive across the same benchmarks. The data efficiency story is a hundred to a thousand times less training data, on top of the cost reduction.

15:58Brooks: Which is genuinely a big deal for who can do this work. A graduate student with a single GPU and an afternoon can now produce a reasoning-improved math model. That used to be a project requiring a cluster, a budget, and either FAANG infrastructure or a grant. The pricing of post-training research changes if this method holds up.

16:20Jessica: There's a quote from the paper: "a reduction in training cost of roughly three orders of magnitude." That sounds like a big number. But the dollar version sticks better. Twenty-five dollars versus a hundred thousand. That's the version of the claim that lodges.

16:36Brooks: Jessica, I want to push on this for a minute, because this is where I get cautious.

16:41Jessica: Go.

16:42Brooks: A few angles. First and most obvious: every benchmark in this paper is math. MATH-500, GSM8K, AMC, AIME, Minerva, OlympiadBench. All math. The mechanistic claim — RL only edits one to three percent of tokens, all within the base's top five — is established for math reasoning. Whether the same picture holds for code generation, multi-step agentic tasks, multi-domain reasoning that mixes math with biology with law — that's not tested. And math is a domain where solution paths are unusually constrained, where base models have seen huge amounts of similar content, where "the right answer is in the top five" is plausible. I would not generalize this story to all of post-training based on math evidence alone.

17:29Jessica: Brooks, that's fair, and the authors are actually pretty restrained about it in the paper. They explicitly scope their claims to outcome-based reasoning in math. The framing of "reasoning improvement" in the abstract is broader than the evidence supports, which is a real critique.

17:48Brooks: Second angle. The Yue paper that motivated this work looked at pass at k — sample many times from the base model, see if any of them are right. The finding was that RL improves pass at one but doesn't really improve pass at high k, because the base model already had the answers somewhere in its distribution. If ReasonMaxxer is doing the same thing as RL — collapsing probability mass onto already-known solutions — it should have the same pass-at-k weakness. The paper measures pass at one and average at eight. It does not run pass at one-twenty-eight against the base model. So we don't know if ReasonMaxxer trades the same thing RL trades.

18:30Jessica: That's a real gap. If you ran pass at one-twenty-eight and the base model started winning, that would be consistent with both methods just being commitment mechanisms. Which fits the theory but would also be a real limitation.

18:45Brooks: Third. The cost numbers I quoted earlier. Twenty-five dollars versus one hundred three thousand. The twenty-five dollars is a measured number from the authors' actual training run. The hundred three thousand is an *estimate* derived from training scripts and hyperparameters published with the Open-Reasoner-Zero release. The paper flags this honestly with italics in the table. But it means the four-thousand-fold ratio is partially measured-low against estimated-high. The real ratio is probably still very large, but the eye-popping number is partly an artifact of how the comparison is constructed.

19:24Jessica: That's a useful caveat. The narrative shape doesn't change — the cost gap is real, it's enormous — but the specific multiplier moves around depending on which baselines you trust.

19:36Brooks: The fourth thing, and this is the subtler one. The oracle intervention proves that *if you know which tokens to change*, changing only those tokens recovers RL's accuracy. Beautiful. ReasonMaxxer doesn't have the oracle. It uses entropy gating to *guess* which tokens to change, and a contrastive loss to figure out what to change them to. The paper jumps from "the correction is sparse and identifiable" to "we can reproduce its effect without RL," and the link is benchmark accuracy. They don't verify token by token that ReasonMaxxer makes the same edits as RL. So you could imagine ReasonMaxxer reaching the same accuracy through a slightly different mechanism — same destination, different path.

20:21Jessica: Right. The mechanistic story and the constructive story share an intuition, but they aren't tied together at the level of "ReasonMaxxer rewrites the same tokens." That's a real follow-up experiment, and the paper doesn't run it.

20:36Brooks: None of those is a knockout. But together they sketch the boundaries of what the paper actually shows. The mechanistic story is established for math. The cost story is real but partially leans on estimates. The connection between mechanism and method is via accuracy, not via direct token-level verification.

20:56Jessica: Take all that into account and what remains is, I think, the most compelling mechanistic account of what RL does to language models that I've seen.

21:05Brooks: The thing I keep coming back to is the random-substitution control. If they hadn't run that, this would be a paper showing two models differ at a certain rate at certain positions. With it, you have a causal claim. Same number of edits, same locations, but the specific edits matter — and there are very few right edits. That experiment is the load-bearing piece. Everything else builds on it.

21:30Jessica: And the constructive payoff is that the right edits are findable from the base model's own uncertainty, and learnable from a tiny dataset with a contrastive signal. So you don't need the RL machinery to reproduce the RL effect.

21:44Brooks: There's a broader implication worth flagging. If sparse policy selection captures most of the gain in this setting, then the elaborate post-training pipelines — distributed PPO, rollout servers, careful KL coefficients, learning-rate schedules tuned across thousands of GPU hours — that infrastructure might be solving a much smaller problem than its size suggests. Not that RL is useless. It might still matter for things this paper doesn't test. But the default investment, for outcome-only math reasoning, looks disproportionate to the problem.

22:18Jessica: And the deeper conceptual move is that this gives the field a mechanistic story it largely lacked. It's not just "RL works, here are the numbers." It's "here is what RL does, at the token level, here is the parameter footprint, and here is a method that uses that mechanistic understanding to do the same thing differently." Mechanistic stories enable better methods. ReasonMaxxer is the proof of concept.

22:44Brooks: Jessica, what do you make of the calibration framing? Because I think that's the line that survives all the caveats.

22:51Jessica: Yeah. The AlphaGo analogy that's been guiding intuitions about RL for language models is genuinely misleading for this regime. The base model isn't a piano without keys that RL is teaching to make sounds. The base model has all the keys. RL is the fine-tuner who walks in and adjusts the few strings that are slightly off, so that when you play the piano you land on the right notes at the moments of doubt. That's a much more modest operation than discovery. And I think that reframing matters more than the dollar figures, because it changes how we think about where reasoning capability comes from. A lot of the credit for reasoning improvements that has been going to the RL stage probably belongs to pretraining.

23:36Brooks: RL isn't composing music. It's tuning a piano that was already mostly in tune.

23:41Jessica: The show notes have a link to the paper and related materials. Thanks for listening to AI Papers: A Deep Dive.