0:00Hope: Picture this. A research group at ETH Zurich sits down to reproduce a hot new method for training reasoning models. They build the comparison cleanly — same model, same data, same evaluation — and their plain-vanilla baseline keeps beating the supposedly-weaker baseline reported in the published paper. Not by a fraction of a point. By five points. Six on some benchmarks. They pull the thread, and what comes out the other end is a silent bug in a widely-used training framework that has been quietly invalidating an entire wave of published comparisons for over a year.
0:41Tyler: Posted to arXiv at the end of April, recorded about a week later. What you're hearing is AI-generated — Hope and I are AI voices from Eleven Labs, and the script comes from Anthropic's Claude Opus 4.7. Neither company is involved in producing the show. The paper is "SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning," from Limozin, Durech, Hoefler, Schlag, and Pyatkin at ETH Zurich and the Allen Institute. And it is, mechanically, a debugging story — but its actual subject is how shared infrastructure can quietly warp the conclusions of a whole subfield.
1:22Hope: Let me set up the fight, because the bug only matters once you see what it was distorting. The standard recipe for teaching a language model to reason — the one DeepSeek-R1 used last year, the one most open-source reasoning models follow — has two stages. First, supervised fine-tuning, where you show the model worked-out solutions and have it imitate them token by token. Then reinforcement learning, where you let the model generate its own attempts, score them against the right answer, and nudge the weights toward attempts that worked. Two stages, in order. Boring. Stable. Effective. Over the past year, a small wave of papers pushed back on that recipe. LUFFY, ReLIFT, SRFT, Prefix-RFT, HPT — methods that said: don't separate the stages, blend them. Mix expert demonstrations with the model's own attempts inside a single training loop. The intuition was reasonable. Pure RL struggles when the model rarely produces a correct answer, and pure imitation just teaches imitation. So mix the signals, and you should get the best of both worlds. Each of these papers reported beating the boring baseline by a meaningful margin. The field was starting to treat mixed-policy methods as the new state of the art.
2:47Tyler: And this is where the ETH group started. They weren't trying to debunk anything. They were trying to reproduce the methods to build on them. But they ran their SFT baseline using one of the standard frameworks, and they ran a separate SFT baseline using a different framework — same data, same model, same hyperparameters. The two should have agreed. They didn't. The first scored about 48 on average across math benchmarks. The second scored about 54. Five-and-a-half points just from switching libraries.
3:23Hope: Right — and it's worth pausing on what that gap means. These are not different methods. This is the same training procedure, on the same data, expected to produce the same model. A five-point gap from infrastructure alone is enormous. The cleaner of the two implementations — the one scoring 54 — was already beating the published mixed-policy methods. Which raised an uncomfortable question: maybe the wins those papers were claiming weren't really wins. So the team went hunting. The first bug they found is in DeepSpeed, the library that handles memory optimization for huge models. To run a 7-billion-parameter model on academic hardware, you need tricks — and one of the standard tricks is gradient accumulation. Your ideal batch is bigger than your GPU memory allows, so you process the batch in chunks, called micro-batches. You sum up the gradients from each chunk. Then, after you've seen all of them, you take a single optimizer step using the accumulated total. Mathematically, this should be identical to processing the full batch at once. It's the standard workaround when you're memory-constrained.
4:42Tyler: And the bug is, it turns out it isn't identical.
4:45Hope: It's not identical, in a very specific way. DeepSpeed has a feature called CPU offloading. You keep the optimizer's bookkeeping in regular system RAM instead of GPU memory, because there's just not enough VRAM. To do an update, the gradients have to be copied from the GPU over to the CPU. And the copy step lives inside a chunk of code that was supposed to run after every micro-batch. But there's a misplaced branch in the code structure: the copy only fires when the micro-batch counter equals zero. In other words, only the first micro-batch's gradients ever get sent over. The rest accumulate their gradients on the GPU correctly — and then, when the optimizer reaches over from the CPU, it only sees the first chunk's contribution. The other seven, or fifteen, or thirty-one chunks are silently discarded.
5:43Tyler: Here's the analogy I keep coming back to. You're trying to weigh a load that's too heavy for your scale, so you split it into eight portions and plan to add the readings. The scale reads each portion correctly. But the logbook the accountant looks at only ever records the first reading. You thought you'd weighed the whole load. You actually weighed one-eighth of it — and crucially, you don't get a warning. The numbers downstream just look like the numbers from a smaller batch. Training keeps going. Loss keeps going down. Nothing screams.
6:20Hope: That's exactly right, Tyler — and the symptom on the gradient-norm plot is consistent. The buggy run shows substantially smaller gradient norms throughout training, because there's just less signal flowing into each step. The bug was introduced in DeepSpeed in September of 2024, in a pull request that was meant to be a refactor. It sat in production for over a year. And because DeepSpeed is the engine under three of the most popular SFT libraries — OpenRLHF, Llama-Factory, and TRL — every academic group running SFT through any of those libraries with CPU offloading turned on inherited the same silent failure.
7:03Tyler: Hope, before you move to the second bug — I want to make sure the listener has the geometry of this in their head. Because the asymmetry is what makes the whole story work.
7:14Hope: Please, set it up.
7:16Tyler: The mixed-policy methods — LUFFY, ReLIFT, all of them — were implemented in a different framework called verl. Verl doesn't use DeepSpeed for its optimizer. It uses a different memory-sharding approach, FSDP, which doesn't have this bug. So the new methods were running on healthy infrastructure. But the SFT baselines they compared themselves against were built using OpenRLHF or Llama-Factory or TRL — all of which run on top of DeepSpeed. The new methods were healthy. The baseline was sick. And every published comparison was structured around that asymmetry without anyone realizing it.
7:56Hope: It's like two runners lining up for a race, and only one of them has their shoelaces tied together. The race results say "new runner faster." But the laces were tied inside the shoe, where no one looks.
8:11Tyler: Right. And the wins were real, in the sense that the numbers in the tables were what the runs produced. They just weren't measuring what everyone thought they were measuring.
8:23Hope: Okay — second bug. Smaller in magnitude, but interesting because it's a different class of error. This one lives in OpenRLHF itself, in how it computes the SFT loss when training is distributed across multiple GPUs. Each GPU computes a local average loss across its mini-batch. Then those local averages get averaged together across GPUs. Mean of means. That's the bug.
8:49Tyler: And mean-of-means doesn't equal the true mean.
8:53Hope: Not when the chunks have different sizes. Here's the school analogy. You want the average test score across a whole school. The right way is to add up every student's score and divide by the total number of students. The wrong way is to compute each classroom's average and then average those classroom averages. If classes have different sizes, the small ones get over-weighted. A class of five averaging eighty percent gets the same vote as a class of fifty averaging sixty percent. The honest average across all fifty-five students is closer to sixty-two. The mean-of-means says seventy. It's just wrong. In SFT, the "students" are response tokens — the tokens the model is actually being graded on, after the prompt. And mini-batches contain wildly different numbers of response tokens, because prompts and responses are different lengths. So mean-of-means systematically mis-weights every step of training. The fix is unglamorous: sum the tokens, sum the losses, divide once at the end.
10:05Tyler: And the lineage of this bug is interesting. It came from pretraining code, where it didn't matter. Pretraining packs data so every batch has exactly the same number of active tokens, and mean-of-means happens to equal the true mean. The same code got copy-pasted into SFT codebases — and SFT doesn't pack the same way. So the equivalence quietly broke. There were even two separate disclosures about this same class of bug in mainstream finetuning code in late 2024 — Daniel Han at Unsloth, and the Hugging Face team, both flagged versions of it. This paper's contribution is showing it's still alive in OpenRLHF and Llama-Factory and that it materially shifts the SFT-for-reasoning numbers.
10:56Hope: And now we get to my favorite table in the paper — the staircase. The authors set up a controlled comparison where they isolate each bug's contribution. Start with the buggy OpenRLHF baseline at 48.3 average. Fix only the loss aggregation bug — the mean-of-means — and you get 49.1. Less than a point of improvement. Then start over and fix only the optimizer bug, leave the loss bug in place, and you get 53.4. Five points. Then fix both, and you land at 54.0 — which lines up almost exactly with the independently-implemented verl baseline at 53.8. Four numbers. They tell the whole story. The optimizer bug accounts for nearly the entire gap. The loss bug is real but small. And the patched pipeline matches the clean pipeline, which closes the diagnosis.
11:53Tyler: That four-number staircase is one of the most informative experimental designs I've seen in a while. It's not flashy, but it ties each number in the headline result to a specific code change — which is the kind of attribution most ML papers gesture at and never actually deliver.
12:14Hope: And then the obvious next step. Take the corrected SFT baseline, run a standard RL stage on top of it, and compare to the published mixed-policy methods on their own turf. On the Qwen Math seven B model, the corrected SFT alone hits 52.2 — already beating LUFFY at 46.3 and ReLIFT at 48.8. Add the RL stage, and you get 57.0. That beats the strongest mixed-policy method, SRFT, by 3.8 points.
12:45Tyler: And then there's Llama, which is a different kind of result entirely.
12:50Hope: The Llama story is wild. On Llama three-point-one eight B — which is a base model with weak math priors, so SFT bootstrapping matters even more — corrected SFT alone reaches 33.9. The best published mixed-policy method on Llama, HPT, scored 21.5. The corrected baseline beat it by twelve points before the RL stage even started. Run the full SFT-then-RL pipeline, and you land at 43.7. LUFFY on Llama: 14.4. ReLIFT on Llama: 15.6. The gap is 22.2 points.
13:25Tyler: Twenty-two points is a number that should make anyone sit up. A pipeline bug producing a twenty-two-point swing on a benchmark family is — well, it's almost too clean. The authors give a plausible explanation: Llama doesn't know much math out of the box, so a weak SFT stage cripples it more dramatically than it cripples Qwen, which already has math priors baked in. Mixed-policy methods on Llama are essentially trying to do bootstrapping and refinement at the same time on a model that can't bootstrap, and the demonstration signal is too sparse to lift it. There's a figure in the paper that makes this visceral. The training reward curves for LUFFY and ReLIFT on Llama stay below thirty percent for the entire 500-step run. They never get off the ground. The corrected SFT-then-RL starts at sixty percent and climbs from there.
14:30Hope: That image is the one I'd put on the cover. The mixed-policy methods, after 500 steps of training, still haven't reached the level the standard pipeline starts at.
14:43Tyler: And it's not just better — it's cheaper. The authors run a truncated version of their pipeline, with only fifty RL steps instead of the standard 500. Ten times shorter. That truncated pipeline still beats every mixed-policy method on the in-distribution math benchmarks. And the FLOP count is roughly half what LUFFY uses, less than half what ReLIFT uses. So the boring recipe is faster, cheaper, and stronger.
15:11Hope: There's a nice secondary finding tucked into the paper that I want to flag, because it tells you something about how subtle these effects can be. On Llama, prior works claimed the model couldn't follow the standard Qwen-style system prompt — the long structured one with explicit reasoning instructions. So they used a simplified prompt. The ETH group finds that this was actually an artifact of undertrained SFT. Once SFT is done correctly, Llama follows the full prompt fine. A "model limitation" turned out to be a training pipeline limitation. Same family of error: the infrastructure was hiding something, and the field had built a small architectural workaround for what was really a plumbing issue.
15:59Tyler: Hope, this is where I want to push on the steelman, because the paper makes some strong claims and the listener should know where the edges are. There are several places where a careful reader should hesitate before generalizing too far.
16:15Hope: Go ahead.
16:15Tyler: First, this is one dataset, one benchmark family, two model sizes. All math. All on a specific 46-thousand-example training set. It's possible that mixed-policy methods have advantages that show up at larger scale, on harder problems, or in domains where the SFT bootstrapping story breaks down differently. The authors are explicit about this and don't oversell. Second, and this one is the most important — the mixed-policy methods position themselves as single-stage alternatives to SFT-then-RL. The paper invalidates the published comparisons, which is a real and important result. But it doesn't rule out the possibility that a mixed-policy stage applied on top of a properly trained SFT model could add value. That experiment hasn't been run. The authors say so directly. So the precise claim is "the published comparisons are wrong," not "mixed-policy is fundamentally a bad idea." Third, the authors' own SFT uses tuned hyperparameters. The original baselines, in some cases, used worse ones. There's an example with SRFT where switching its SFT to LUFFY's hyperparameters jumps the baseline by five-and-a-half points and shrinks the apparent SRFT advantage from 7.6 points to 2.1. That's a different story than the bug story — it's "the baselines were sandbagged by suboptimal hyperparameter choice as well as by bugs." Fair point in the paper's favor. But it does mean a skeptic could ask how much of the corrected baseline's strength is just careful tuning that the original baselines simply didn't get. And fourth, the reproductions of mixed-policy methods are single-seed. The authors run their own SFT and SFT-then-RL with three seeds and report standard deviations, which is responsible. The LUFFY and ReLIFT reproductions are not. The gaps are large enough that variance is unlikely to flip the ordering, but a more defensible comparison would multi-seed both sides.
18:23Hope: All of those are fair, Tyler, and I think the authors hold up well under each of them. They acknowledge every one in the limitations section. The frame I'd offer is that this isn't a paper claiming mixed-policy methods don't work — it's a paper claiming the evidence cited in their favor doesn't actually support what it was taken to support. Those are two very different claims, and the authors are careful to make the narrower one.
18:49Tyler: Right. And the narrower claim is more interesting anyway — because the broader implication is what the paper is really after.
18:57Hope: Which is?
18:58Tyler: Two bugs in two widely-shared open-source frameworks were enough to systematically warp the conclusions of at least five published papers in the same subfield. Not because anyone was cheating. Not because of cherry-picking. Not because of bad faith. Because everyone's baselines flowed through the same broken plumbing. The authors put it bluntly: silent bugs in widely-used pipelines were sufficient to systematically deflate baselines across multiple independent studies.
19:27Hope: It's a structural failure mode. The kind that's almost impossible to catch from inside the system. Five different research groups can each "independently replicate" a result, and if they're all running through the same library, their independence is illusory at the level that matters. They're not testing whether the result is real. They're testing whether the library is consistent with itself.
19:51Tyler: There's an analogy I want to offer here, even though it's a little forced. Imagine a neighborhood where every house gets its water from the same main pipe, and the pipe has a slow contaminant leak. Every household has a water-quality gauge, and every gauge reads normal — because every gauge in the neighborhood is calibrated against samples drawn from the same contaminated source. Independent measurement, same reading, false consensus. The only way to detect the problem is to bring water in from a different pipe.
20:22Hope: And in the ML version, the "different pipe" was just a researcher who happened to use a different framework for the baseline than for the new method. That's it. The authors didn't build a new diagnostic tool. They didn't develop new theory. They re-implemented one experiment in a second library and noticed the readings disagreed.
20:44Tyler: Which is the methodological argument the paper is making, even when it's not making it explicitly. Framework diversity is a kind of epistemic insurance. If a subfield concentrates all its empirical work on a single training stack, a single bug in that stack becomes invisible consensus. The fix isn't to demand bug-free libraries — that's not realistic. The fix is to keep enough diversity in the infrastructure that disagreements between implementations show up as disagreements in numbers, which can then be investigated.
21:18Hope: And the paper closes on the implication that bites hardest. The DeepSpeed CPU-offloading bug doesn't only affect SFT for reasoning. It affects any training run that uses DeepSpeed with CPU offloading and gradient accumulation — which is most memory-constrained academic work. Which means the same silent failure has been operating in other subfields for over a year, on baselines that nobody re-implemented in a different framework. We don't know what those subfields looked like with a healthy baseline. The authors don't claim to know. But the question is now sitting in plain sight.
21:56Tyler: That's the takeaway I want listeners to carry. The narrow finding is that SFT-then-RL beats mixed-policy methods for math reasoning, by margins large enough that the field's conclusions on this specific question should flip. The broader finding is that we should be more nervous about benchmark-driven progress in regimes where everyone's baseline runs through the same library. Not paralyzed — but more nervous than we have been.
22:25Hope: And a small hopeful note, Tyler. The bug was caught. Not because of formal incentives — there's no conference reward for finding a bug in DeepSpeed that affects other people's papers — but because a research group that wanted to build on a method took the time to actually reproduce its baseline cleanly. That's the version of the field doing its job. It's slower. It's less glamorous. It produced this paper.
22:52Tyler: Show notes have a link to the paper and related materials, if this episode caught you and you want to read the actual bug diff.
23:00Hope: This is AI Papers: A Deep Dive. Thanks for listening.