How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning

0:00Bella: Sixteen GPUs. Forty-six hours. Fourteen hundred and seventy-two dollars on the invoice — call it fifteen hundred. That's what it cost a team at Sapient Intelligence and MIT to train a one-billion-parameter language model from scratch this month. And on benchmarks like grade-school math and the MATH dataset, that model goes toe to toe with — sometimes beats — Llama 3.2 3B, Qwen 3.5 2B, Gemma 3 4B, and OLMo 3 7B. Models that cost their builders, depending on how you count, somewhere between roughly a hundred and four hundred times more compute, and a hundred to nine hundred times more tokens to train.

0:42Tyler: That's the headline. The paper went up on arXiv on May twentieth, twenty-twenty-six, and we're recording four days later. What you're hearing is AI-generated — the script is from Anthropic's Claude Opus 4.7, and I'm Tyler, that's Bella, we're both AI voices from Eleven Labs. Neither company is involved in producing the show. The paper is "HRM-Text: Efficient Pretraining Beyond Scaling," and the reason that fifteen-hundred-dollar number matters isn't just democratization — though it matters for that too. It's that the authors think they've shown the trillion-token pretraining race was never necessary in the first place.

1:25Bella: Right. And the way to feel that claim, Tyler, is to look at the two assumptions baked into how everyone trains language models right now. First assumption: the model is a vanilla decoder-only Transformer. Stack of identical blocks, each with its own parameters, gradients flow straight up and down. Second assumption: you train it by dumping trillions of tokens of internet text into it and grading it on predicting every single token — every word of every prompt, every word of every answer, every word of the boilerplate between. Both assumptions look, on inspection, wasteful. The paper rebuilds the model around the first assumption being wrong, and rebuilds the training around the second one being wrong.

2:15Tyler: And the two changes compound. Which is the part the ablation table makes really clean — but I think we should set up the architecture story first, because that's where the brain analogy comes in. You wanted to take us into the H and L modules.

2:31Bella: Yeah. The biological hook the authors keep gesturing at is the frontoparietal loop in the brain — basically, fast reflexive execution and slow strategic deliberation operating on different clocks. Think of a strong chess player. There's a fast part of their thinking that just sees, move to move: that knight wants to be here, this is a kingside attack. And there's a slower part that updates only when something fundamental about the position changes: okay, we're in a closed game now, I should be playing for the long term. The fast layer fires many times for every update of the slow layer. HRM-Text builds that split directly into the architecture. There are two modules — they call them L and H. L is the fast one. H is the slow one. One forward pass through the model runs the L module three times, then updates H once. And it does that whole cycle twice. So a single forward pass is eight module-steps total, but each module has half the parameters of a comparable Transformer.

3:37Tyler: So the model is, in a sense, reusing its weights — running the same blocks multiple times rather than stacking more layers.

3:46Bella: Exactly right. It's a recurrent design — and this is an old, recurring idea in deep learning. Universal Transformers, Looped Transformers, RINS — they all share the family resemblance: get more computation per parameter by looping the same block. The fast/slow split is what's new here. And the reason recurrent designs haven't dominated language modeling, despite being efficient on paper, is that they're a nightmare to train. When you propagate gradients backward through the same operation applied many times in a row, the gradients get spiky. Vanishing, exploding, heavy-tailed events where one training step suddenly has a gradient a hundred times larger than the last. The training run blows up.

4:31Tyler: Which is where MagicNorm comes in. Which the authors named, I assume, because it deserves it.

4:37Bella: It's a cute name, and the trick underneath is actually clever. So normalization layers in a Transformer — these are the things that keep numbers from blowing up or shrinking to zero as they flow through the network. There's a classic tradeoff. You can put the normalization BEFORE the main computation in each block — that's called PreNorm. Gradients flow cleanly during training, but activations drift larger as you go deeper. Or you put it AFTER — PostNorm. Activations stay bounded, but gradients get strangled. The MagicNorm idea exploits an asymmetry you wouldn't think to use. Here's the image. Imagine a tightrope walker crossing a very long rope. There's a safety net — but the net only covers the last few feet of the crossing. You want the rope itself to be stable for the whole walk. But you only need the catching apparatus where it can actually catch you.

5:34Tyler: Translate the metaphor.

5:35Bella: The forward pass — when the model is producing predictions — goes through every recurrent step. All eight of them. The backward pass — when gradients flow back to update the weights — is truncated. The authors only let gradients flow through the last few steps, not the whole unroll. So the model thinks forward through a long chain but only learns from the tail end of that chain. MagicNorm puts a stabilizing norm at the exit of every recurrent step. On the forward pass, that norm fires eight times — lots of stabilization, activations stay bounded. On the backward pass, it only fires a few times, because gradients are truncated — so the gradient-friendly PreNorm behavior inside each block dominates. Same architecture, both behaviors, because the forward and backward horizons are different lengths. There's a second trick they layer on top — they call it warmup deep credit assignment — where they start training with an even shorter backward horizon, just two steps, and gradually extend it as the model stabilizes. Short leash early, longer leash once the optimization landscape isn't a minefield.

6:49Tyler: That asymmetry is genuinely satisfying. The norm gets to be in two places at once because the forward and backward passes don't see the same number of it. Okay — so that's the architecture, and the stability tricks that make it trainable. The other half of the story is the objective. What are you actually grading the model on?

7:11Bella: This is where I hand it over, Tyler. Because I think the objective change is, in some ways, the more provocative claim.

7:19Tyler: I think it might be. So the standard pretraining recipe — and I want to make sure this is clear — is: take a piece of internet text, any piece, doesn't matter what, and grade the model on predicting every single word of it given the previous words. Not just the interesting words, not just the words a user would want generated — every word. Including boilerplate. Including the parts of the prompt the model will never actually have to produce when it's deployed. At inference time, language models are doing conditional generation. Given a question, produce an answer. So the natural question the authors ask is: why are we spending most of pretraining teaching the model to predict the question?

8:04Bella: Right — because at inference, the question is given.

8:09Tyler: It's given. You don't generate it. You read it. And the analogy the paper basically writes itself into is the exam grader. Imagine grading a student two ways. In the first, you grade them on copying down the question accurately AND writing the answer. In the second, you only grade the answer — the question is just printed on the page. The first way wastes the student's effort, and it wastes yours as the grader. Standard pretraining is the first style. HRM-Text grades only the answer. Concretely: instead of computing loss over every token of the document, you compute loss only over the response tokens of an instruction-response pair. Every gradient update directly improves response generation. Nothing is spent teaching the model to autoregressively model prompt-style text.

9:00Bella: I want to push on this for a second, because there's a steelman of the standard objective that doesn't show up immediately. The argument FOR predicting every token is that it teaches you general language modeling. The model learns about syntax, vocabulary, discourse structure by being graded on everything. The exam-grader analogy is a little unfair because copying the question is trivial; predicting prompt-like text in pretraining is actually informative.

9:30Tyler: Sure — and the authors don't deny that. Their bet is that the marginal gain from grading the model on questions is much smaller than people have assumed, and that the gradient signal you concentrate by not doing it is worth the tradeoff. The ablation supports that empirically. But the steelman is real. We'll come back to it. There's a second change to flag here that ties into the same logic. Once you decide you're not grading the model on the question, you can also stop forcing it to read the question one word at a time. This is the PrefixLM piece.

10:05Bella: Walk through that.

10:07Tyler: So in a normal decoder-only Transformer, every token uses what's called a causal mask. Each word can only see the words that came before it. That makes sense when you're generating, because you don't get to know the future. But when you're reading a question that's already given to you — when the whole prompt is sitting in front of you — there's no reason the model shouldn't be able to look at all of it freely. The image I like, and I think the paper uses something close to this: when you read a question on a page, your eyes can move around. You glance at the end, you go back to the beginning, you cross-reference. But when you WRITE the answer, you have to produce it one word at a time, in order, committing to each word before knowing the next. PrefixLM gives the model exactly that asymmetry. The question tokens can all see each other simultaneously, like an encoder reading the whole thing. The answer tokens are still generated one at a time, causally. Same model. Same forward pass. Just a different mask.

11:13Bella: And this is encoder-like behavior on the question without needing a second model.

11:18Tyler: Without needing a second model. The paper shows this actually increases attention entropy across the layers — the model uses more of the prompt, more globally. It's looking around. And there's a beautiful ablation that ties all this together. You don't need to see the table. They start with a vanilla Transformer trained on standard causal language modeling over full text. MMLU score: forty and a half. Then they switch only the objective — same model, but now graded only on answer tokens. MMLU jumps to forty-eight. Then they add PrefixLM attention on top — fifty-three. Then they swap in the HRM architecture — sixty-one. Each step adds something. None of them does all the work. The contributions are additive.

12:05Bella: That's the spine of the technical contribution. And I want to land the payoff piece, which is the question of whether the recurrent depth actually does anything. Because you could imagine all of this being true — clever architecture, clever objective — and the result still being that the recurrent loops are essentially decorative. The model commits to its answer early, and the later passes don't really change anything.

12:31Tyler: Which is exactly what we know happens in standard Transformers.

12:35Bella: It's what happens in standard Transformers. There's a diagnostic called the logit lens — you take the model's intermediate hidden state at each layer, project it forward as if that layer were the final layer, and ask: what would the prediction be if we stopped here? In standard Transformers, you can stop relatively early. The first third of the layers settle on an answer, the deeper layers nudge it around, but the prediction is basically locked in by the middle of the network. Picture a committee where each member adds their two cents — except the first few members lock in a decision and everyone after them just nods along. HRM is different. When you run the logit lens through HRM's recurrent cycles, the prediction keeps meaningfully shifting all the way through. The last cycle is still updating the answer. The committee is still actually deliberating.

13:29Tyler: Which is, on its own, a striking result. The scale-versus-structure debate in machine learning has been running for years, and the story has mostly been scale. Bigger models, more data, more compute, capabilities emerge. What this is suggesting is that some of what we've been calling emergent capability is actually a workaround for under-utilized depth. The standard Transformer wastes most of its layers. If you build an architecture that doesn't waste them, you don't need as much scale to get the same behavior.

14:01Bella: And that's the deep version of the democratization claim. The fifteen-hundred-dollar number is the visceral hook. The under-utilized-depth result is the intellectual reason it works.

14:13Tyler: Okay. So if I'm being honest, this is also where the skeptic in me wants airtime. Because the headline numbers are extraordinary, and extraordinary numbers deserve scrutiny. There are three places where I want to push back, and I think the paper handles two of them well and one of them less well. You want to take any of these, or should I just run through them?

14:37Bella: Run through them, Tyler. I'll push back where I disagree.

14:40Tyler: First: it's not apples-to-apples. Llama, Qwen, Gemma, OLMo — these are general-purpose pretrained models. They were trained on raw web text and only later instruction-tuned for benchmarks. HRM-Text trains exclusively on instruction-response pairs from the start. So when you compare them on benchmarks that mostly test instruction-following — math problems, reasoning, multiple-choice questions — you're comparing a model that trained for exactly that task against models that did it as a finishing step. Of course the specialized model looks competitive. The fair question is whether HRM-Text has the same generality as the comparison models. Could you fine-tune it for a novel downstream task the way you can with Llama? The paper doesn't really show that. And it's suggestive that HRM-Text loses noticeably on Hellaswag, which is more of a commonsense and world-knowledge benchmark — sixty-three percent versus seventy-seven for Gemma. That gap reads as: the broad factual coverage isn't there.

15:46Bella: The authors actually acknowledge this point pretty cleanly. They frame HRM-Text as good at reasoning and task execution, less good at broad factual recall. And they suggest external memory or retrieval as the complement. So this isn't a critique they're hiding from.

16:03Tyler: They're not hiding from it. But it does change what the fifteen-hundred-dollar number means. It's fifteen hundred dollars to train a competitive reasoning model. Not fifteen hundred dollars to train a competitive general-purpose foundation model. Those are different claims.

16:22Bella: Fair.

16:22Tyler: Second pushback — and this is the one I think gets less attention than it deserves. The training data is heavily curated. Stratified by domain. Capped, upsampled, deduplicated. It includes datasets like OpenMathInstruct and NuminaMath that are specifically built to teach mathematical reasoning. The strong math benchmark numbers — eighty-four percent on grade-school math, fifty-six on the MATH dataset — those might reflect the curated data mixture as much as the architecture. The ablation in the paper controls for objective and architecture, but it does NOT control for the data mixture. A standard Transformer trained on this same curated mixture would presumably also outperform a Transformer trained on raw web text. And the paper doesn't isolate how much of the headline improvement comes from data curation versus architecture versus objective.

17:20Bella: That's the place I think we can add something the paper doesn't. Because data curation is real work, and it's expensive in human effort if not in compute, and that cost isn't reflected in the fifteen-hundred-dollar invoice. Somebody assembled and filtered that mixture, and that somebody isn't on the GPU bill.

17:42Tyler: It isn't on the bill. Third pushback — and this one the authors are explicit about — is that scaling is unverified. They only tested up to one billion parameters for HRM, three billion for the Transformer baseline. The whole claim is that this architecture changes the compute-to-performance ratio. But the comparison models are two to seven billion parameters trained on far more tokens. It's possible that HRM-Text's competitiveness is specific to the small-model regime, where the comparison models are themselves undertrained relative to their architecture. Whether HRM-Text at seven billion trained on four hundred billion tokens would compete with Llama at seventy billion trained on fifteen trillion — that's unknown. And the authors say so directly. So this isn't a hidden flaw. It's a stated open question.

18:36Bella: And the way they frame the whole paper is as an existence proof rather than a recipe. They're not saying this is the new pretraining paradigm. They're saying the compute-to-performance ratio is not a law of nature. The current paradigm leaves enormous efficiency on the table, and here's a one-billion-parameter, two-day, fifteen-hundred-dollar demonstration that you can match a much more expensive training run by changing what you're optimizing and how the architecture spends its computation. That's a narrower claim, and it's the claim that survives all three critiques.

19:15Tyler: Which I think is the right frame to leave the listener with. Not "the trillion-token race was a mistake." More like: the trillion-token race was solving a problem that better architecture could partly have avoided. The standard recipe works. It also wastes a lot of computation on under-utilized depth and on predicting text the model will never generate. If you fix both of those, you can get into the performance neighborhood of much larger, much more expensive models with a university-lab budget. That's not nothing.

19:50Bella: That's not nothing. And honestly, the most exciting thing about this paper is the invitation in the conclusion. The authors basically say: pretraining from scratch is accessible again, come join us. For most of the last few years, foundational architecture research has lived inside a handful of labs that can afford it. If a sixteen-GPU, two-day, fifteen-hundred-dollar training run can produce something this competitive, that means a much larger community of researchers gets to ask architectural questions and actually answer them. Not at frontier scale yet. But at the scale where ideas can be tested and iterated. The compute moat shrinks by a meaningful amount.

20:33Tyler: The paper is from Wang and collaborators at Sapient Intelligence and MIT. Link's in the show notes, along with some related reading if you want to go deeper.

20:44Bella: And if you want the full transcript with definitions baked in for every term we touched — MagicNorm, prefix language modeling, truncated backprop through time — that's all on paperdive dot AI, with concept pages that connect this episode to the others we've done on efficient training and architecture.

21:04Tyler: Thanks for listening. This was AI Papers: A Deep Dive.