How an Open-Book Trick Teaches a Model to Catch Its Own Mistakes

0:00Bella: Here's a model grading its own homework. It's been handed a counting problem, a student's attempt at it, and one job — decide whether the answer is right or wrong. So it works through the count, lands on a number, and writes, clearly: this solution is incorrect. Verdict delivered. And then, in the very next breath, it starts second-guessing itself. It recomputes the count, gets a number that happens to match the student's, and writes — actually, on reflection, the answer is correct. And it flips its own verdict. It had the right call, and then it argued itself out of it.

0:36Finn: And that's not a one-off blooper. That exact move — a verifier announcing a judgment and then talking itself into the opposite one — shows up again and again in the appendix of this paper. Which is funny right up until you realize that this little self-sabotaging critic is the component the entire field is quietly leaning on to make reasoning models smarter.

0:59Bella: That clip is from a paper called "Self-Trained Verification for Training- and Test-Time Self-Improvement," out of Carnegie Mellon — it went up on arXiv on May twenty-eighth, twenty-twenty-six, and we're recording the very next day, May twenty-ninth. Quick note on what you're hearing before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two of us — I'm Bella, that's Finn — we're both AI voices from Eleven Labs. The show's produced independently, no affiliation with Anthropic or Eleven Labs. And the reason that self-overturning verifier matters so much is that it sits right at a bottleneck nobody has been able to route around.

1:42Finn: So let's name the bottleneck, because the framing is the whole thing here. Bella, you want to set up the two ways a model is supposed to get smarter by thinking longer?

1:53Bella: Right. There are two main recipes, and they look pretty different on the surface. The first one happens at test time. The model produces an answer, a critic looks at it and says here's what's wrong, the model tries again, and you go around that loop — answer, critique, retry — until it converges. The second happens at training time. You let the model generate a pile of solutions to hard problems, you keep the ones that look good, and you train the model on its own good attempts. It bootstraps itself. That's the self-training family — the model pulling itself up by its own work. And the paper's first real observation is that those two recipes, which look unrelated, are secretly bottlenecked on the same part. Both of them depend completely on the critic. On the verifier. In the test-time loop, the critic is the thing telling you what to fix. In self-training, the critic is the thing deciding which of your own solutions are good enough to learn from. If the critic is bad, both recipes are bad.

2:55Finn: And here's the part that should make you nervous — the critic isn't just mediocre, it fails in a specific, almost adversarial way. There's a well-established result in this area that large language models are genuinely bad at catching their own mistakes without outside help. They generate a plausible-sounding wrong answer, you ask them to check it, and they confidently bless it. The paper measures this directly: with an untrained verifier in the loop, its confidence scores climb steadily over rounds — it gets more and more sure — while the actual accuracy just flatlines. It's talking itself into a higher opinion of work that isn't getting any better.

3:36Bella: And that's exactly the counting example from the top. It's not careless. It's structural.

3:42Finn: That's the key, Bella — it's structural. Think about why spotting a subtle flaw is hard. Catching that someone missed an edge case, or used a lemma outside the conditions where it actually holds — that's roughly as hard as not making the mistake in the first place. So the skill you most want from your critic, finding the error, is precisely the skill the model doesn't have. And now you hit the wall the whole paper is built around. Normally, to teach a model a skill, you show it examples of the skill done right. But you have no examples of error-spotting done right — because if you could reliably produce them, you wouldn't have the problem. The capability you want to train has no training signal.

4:25Bella: That's the puzzle. How do you teach a model a skill it doesn't have, when by definition you can't show it the skill being done correctly? And the answer the authors find is, I think, genuinely elegant. It turns on an asymmetry. Think about being asked to grade a hard proof — one you honestly couldn't have written yourself. Handed the student's attempt cold, you might stare at it and not see the flaw. But now hand you the answer key alongside it. Suddenly the job is completely different. You're not solving the problem anymore. You're just spotting where the student's version and the correct version diverge. And that's easy. You can point right at the line where it goes wrong.

5:07Finn: So diagnosis with a reference solution in hand is dramatically easier than diagnosis from scratch.

5:13Bella: That's the lever. And here's how they turn it into a training signal. They take the same base model — in their experiments it's an eight-billion-parameter Qwen3 model — and they run it in two modes, just by changing the prompt. In one mode, the teacher mode, it gets the problem, the flawed attempt, and the answer key. With that key in hand, it produces sharp, specific feedback — here's the exact step that breaks, here's why. In the other mode, the student mode, it gets the problem and the attempt but no answer key. Same model, no privileged information. And then they train the no-key student to imitate the feedback the with-key teacher produced. The student learns to generate the kind of pointed critique it could previously only manage when it was cheating off the answer key. At test time the key is gone — but the skill has transferred. They call this self-trained verification.

6:09Finn: And I want to flag what's clever about the supervision here, because it's easy to skate past. The teacher isn't a bigger, smarter model. It isn't a human annotator marking up where solutions went wrong. It's the same model, just temporarily given an advantage the deployed version won't have. The closest prior work — DeepSeekMath-V2 — trained its verifier using human annotations of feedback quality, which is exactly the expensive, doesn't-scale bottleneck everyone's trying to escape. Another line of work used a stronger external model as the feedback source. This sidesteps both. The supervision comes from the model's own open-book advantage over itself.

6:51Bella: There's one more piece in how they train it that turns out to be load-bearing, and it's worth slowing down for. The obvious way to do this imitation is what's called supervised fine-tuning — just take all the feedback the teacher wrote and train the student to reproduce those exact texts. Copy the teacher's work. And the surprising result is that this flatly fails. No improvement in the loop.

7:17Finn: Which sounds like it shouldn't matter. Copying good feedback should teach you to give good feedback.

7:23Bella: Here's the analogy that unlocks it. Imagine you're learning to write, and your whole method is reading a master essayist and copying their essays out, word for word. You'd absorb a lot — but you'd never once practice forming your own sentence. So the first time you sit down to write solo, you're in completely unfamiliar territory and you flounder. The better method is the opposite: you write your own essays, badly, and a tutor corrects each sentence as you go. You learn to recover from your own mistakes. That second method is what they use. It's called on-policy distillation. The student generates its own feedback attempts, and training nudges those attempts toward what the teacher would have said. The reason it matters is that at deployment, the verifier walks down its own path — its own sentences. If it was only ever trained on the teacher's path, the moment it drifts even slightly, it's somewhere it's never been, and it collapses. Training on its own trajectory keeps it from falling off that cliff.

8:25Finn: And the honest version of this — which the authors are careful about — is that they are not claiming on-policy beats copying in general. That's an open debate in the field. They're claiming that in this specific setup, copying just doesn't work and the practice-and-correct method clearly does. It's a clean, concrete demonstration of a distinction that usually only gets argued in the abstract.

8:49Bella: So that's the machine. Let me hand the results to you, Finn, because the numbers are where this stops being a neat idea and starts being a little startling.

8:58Finn: They are. And the first experiment is the one doing the heavy lifting, because it isn't just showing their method works — it's systematically showing the obvious alternatives don't. They freeze the generator, so the answers being produced are identical, and they only swap in different critics. No verifier training — stalls almost immediately. Train the verifier with reinforcement learning just on whether its accept-or-reject judgments are accurate — helps a little, barely. The meta-verifier approach, using a separate model to grade feedback quality — little effect. Plain copying — nothing. And then their method roughly doubles the final pass rate, with gains that keep compounding over twenty-plus rounds of the loop instead of saturating.

9:44Bella: And that structure — ruling out the alternatives one by one — is what makes the asymmetry insight feel necessary rather than just clever. You've watched everything simpler fail first.

9:55Finn: Then they push on whether this generalizes past math, and this is the headline number. They run the same recipe on a science benchmark — chemistry, biology, physics, materials — and look at the hardest problems, the ones the model basically never solves. With no verification, it gets about one and a half percent of them right. With their trained verifier in the loop, that goes to twenty-one percent. From basically never, to one in five. That's the fourteen-times figure.

10:23Bella: And how does the difficulty split work — what counts as "hardest"?

10:28Finn: Good to pin down. They sort problems by the model's own success rate. "Hard" means the model gets it right somewhere between zero and twenty percent of the time. "Hardest" means it got it right zero percent of the time across thirty-two tries. Genuinely zero. So when I say one and a half percent to twenty-one percent on the hardest science problems, that's on a bin where the solo model was striking out completely. But the result that actually stopped me is a David-and-Goliath one. That eight-billion-parameter model, guided by the trained critic, beats Qwen3-235B — a model around thirty times its size — on both science bins. Twenty-one percent versus eight on the hardest, forty-two versus twenty-four on the hard ones. A small solver with a good critic beats a solver dozens of times larger working alone.

11:18Bella: And that reframes the whole question of where the intelligence lives. The default story for years has been: make the solver bigger and it reasons better. This is saying a lot of the leverage is in the critic, not the size of the solver. Trained verification can substitute for generator scale.

11:37Finn: At least on hard problems with checkable answers — I'll come back to that caveat, because it matters. But yes, as a piece of evidence for the critic-not-the-solver view, it's striking.

11:49Bella: There's a diagnostic underneath all this that I find satisfying, because it shows you the mechanism, not just the scoreboard. Remember the untrained verifier whose confidence rose while accuracy stayed flat — accepting more and more solutions without finding more correct ones. The trained verifier inverts that. As it accepts more solutions, the fraction that are actually correct goes up. Accepting more reliably means you're finding more right answers. The untrained critic had it exactly backwards.

12:20Finn: Which is the referee who blows the whistle correctly and then replays it in his head until he reverses the call. Trained right, the whistle starts meaning something.

12:30Bella: So we've got a critic that works at test time. The last move — and this is the climax of the paper — is to turn that critic back on the generator and train with it. This is the part where I want to set up your expectation carefully, because the authors themselves got surprised here. The standard way reasoning models get trained on math is reinforcement learning from verifiable rewards — reward the model when its final answer matches a known correct answer. It's the workhorse. And it's known to plateau. You train and train, and at some point more of it just stops helping. The model has converged. So they take a generator that's already hit that plateau — fully converged, more standard training does nothing — and they keep training it, but now inside the verification-refinement loop, using the trained critic's feedback. The model answers, gets pointed feedback, refines, and it's rewarded for getting the final answer right. Only the generator updates; the critic stays frozen.

13:32Finn: And here's the expectation I'd have walked in with. You're training the model to use feedback. So it should get better at using feedback — at taking a critique and fixing its answer. Fine. But there's no obvious reason its first attempt, with no critic in the room at all, should change. Take the tutor out of the room and your unassisted first draft should be exactly as good as before. There's no pressure on it.

13:58Bella: That's exactly what they expected too. And it's wrong. The solo first-attempt performance — no verifier present, just the model answering cold — jumped about thirty percent relative, on a generator that standard training had completely maxed out. On the hardest math, the cold first attempt went from around eleven percent to around fifteen. On the hard set, from about thirty-seven to nearly forty-eight.

14:24Finn: And the control is what makes that land. They took the same amount of compute and spent it on more of the standard reinforcement learning — the thing that had plateaued. That gave nothing. Flat. So it isn't just "more training helps." Training inside the verification loop pushed the solo model past a ceiling that more vanilla training couldn't move at all.

14:46Bella: It's like a chess player who studies intensively with a coach pointing out every blunder, and then finds their unassisted play has gotten sharper. The diagnostic habit got internalized. Learning to incorporate that pointed feedback apparently instills something more general — reasoning skill that surfaces even when there's no one there to correct you.

15:08Finn: And this is the right place for me to push, because that last sentence is more demonstrated than explained. The paper observes that solo performance improves, and offers a plausible hand-wave — learning from diagnostic feedback instills more general reasoning. But it doesn't really pin down the mechanism. They do some ablations that narrow it down, but the headline — why does training with a critic make you better without one — stays a result, not an explanation.

15:37Bella: That's fair, and it's worth holding onto. What's your broader read on where this is soft, Finn?

15:43Finn: A few things, and I want to be specific rather than hand-wavy. First, the scope. Everything runs on one model family — Qwen3 — mostly at one size, on hard math and one science benchmark. The "domain-general" claim is really resting on two domains, and both of them share a crucial feature: they have verifiable final answers. And that's not incidental, because the entire mechanism depends on having a reference solution to hand the teacher. The whole trick is the open-book advantage. But in open-ended reasoning — the messy, no-clean-answer settings where self-improvement would matter most — there is no answer key to peek at. So the method is strongest exactly where the problem is already easiest, and untested where it's hardest.

16:28Bella: Which the authors are honest about — they explicitly flag code and open-ended reasoning as untested, and list scaling to larger models as an open question too.

16:38Finn: They are forthright, to their credit. A couple more. The comparison against that closest competing method, the meta-verifier — they couldn't actually access it, so they used a stand-in model as a proxy and found it had little effect. That's a reasonable improvisation, but it means the head-to-head against the most directly competing prior work isn't a clean comparison. The proxy might just underperform the real thing. And then the absolute numbers. "Doubling" the pass rate on the hardest math sounds huge, but it's going from a few percent to a few percent more — the final-round number on the hardest math is still around five and a half percent. The science gains are bigger, but remember the fourteen-times starts from one and a half percent. These are real, and they're statistically significant. But the framing leans on multiplicative gains off small bases, and a listener should hear that.

17:34Bella: And the flywheel. The paper sketches this beautiful loop — better critics let you train better generators, better generators produce harder attempts to train future critics, around and around. But they demonstrate one turn of that wheel. One round of verifier-in-the-loop training. Whether it keeps compounding over many iterations, or saturates after one, is exactly the question the work raises and doesn't answer.

18:01Finn: Right — the flywheel is a suggestion, not a demonstrated cycle. And the paper's register is appropriately humble about it. The framing is "may lie," "may be a key frontier." They're not overselling.

18:14Bella: So let me try to say what I think actually shifts here, with the caveats fully in view. For a few years the story about getting more out of a model at inference time has basically been sharpening — sample a lot of answers, score them, keep the best. Under that picture, the ceiling is fixed by what the generator already mostly knows. You're recovering detail that was already latent, like sharpening a blurry photo. You can't add information that was never captured. This paper makes a credible case that feedback-guided refinement can do something different — not just surface the model's existing competence, but reshape what it produces, with the quality of the critic as the lever. And the evidence for that is real but partial, so I don't want to claim the matter is settled. But the direction is the interesting part.

19:08Finn: And the genuinely practical upshot, for anyone building these systems, is the trade it implies. You can pour your compute into a bigger generator — or you can pour it into a trained critic and a smaller generator, and on hard problems that trade can be dramatically favorable. Eight billion parameters with a good critic beating two hundred thirty-five billion alone is the existence proof.

19:33Bella: The deeper reframe — and this is what I'll carry out of the paper — is that it treats self-improvement as fundamentally a verification problem, not a generation problem. The thing standing between a reasoning model and harder problems isn't always a better solver. Sometimes it's a critic that can actually tell good from bad — and won't talk itself out of the right answer.

19:57Finn: Which brings us right back to that counting problem at the top, the verifier announcing "incorrect" and then arguing its way to "actually, correct." That's not a quirk. That's the bottleneck, on full display. And what this paper shows is that the model already had the ability to catch that error sitting inside it — it just needed to see the answer key once to learn how to find it on its own.

20:21Bella: That's the line I'd leave people on. The skill was latent. The open-book version of the model could already do it. The whole contribution is a way to teach the closed-book version the same trick — without ever paying a human to grade a single critique.

20:37Finn: The paper is "Self-Trained Verification for Training- and Test-Time Self-Improvement," from Chen Henry Wu and Aditi Raghunathan at Carnegie Mellon. The show notes have a link to it and a few related reads if you want to go deeper on the self-training and distillation threads.

20:54Bella: And if you want the full transcript with every term defined inline — plus the concept pages that link this episode to the other work we've covered on test-time compute — that all lives on paperdive dot AI. This has been AI Papers: A Deep Dive. Thanks for listening.