Why You Can't Fine-Tune Foresight Into an AI Agent

0:00Cassidy: A team trained a language model to imagine the future before acting — to write out a short forecast of how its plan would unfold, then stamp a confidence number on it. The model learned the format almost perfectly. Essentially a hundred percent of its outputs had the right shape: a tidy look-ahead block, a percentage, the works. And the task performance barely moved at all. Then they actually read what it wrote. The foresight was hollow. One plan was vague, internally contradictory, and the model had slapped a confident hundred percent right on top of it.

0:36Eric: Quick heads up before we dig in — this is an AI-made explainer, both voices included. And that gap — perfect form, empty content — is the whole paper. By the end you'll understand why teaching a model the shape of foresight gives you a confident lie instead of an actual plan, and what it takes to install the real thing.

0:56Cassidy: Which is the part that should bother you, Eric. Fine-tuning is the tool we reach for whenever we want a model to behave a certain way. Show it good examples, reward the good behavior. The unsettling claim here is that for some abilities, that just doesn't work — you get a flawless imitation of the skill and none of the skill.

1:17Eric: And this matters way beyond this one paper, because the next wave of AI agents is going to be judged on whether they can run a long task without wandering down a doomed path for fifteen steps — and on whether they know when they're lost. A confidence number you can actually trust is exactly the reliability signal a deployed system needs. So the question of whether you can train that in, or only fake it, is kind of the whole ballgame.

1:45Cassidy: Let's start with what the agents do now, because that's the thing they're trying to fix. Most language-model agents are reactive. They take a step, see what the environment says, take another step. They've got enormous knowledge but no habit of pausing to play out a plan in their head before committing to it. The paper actually reaches all the way back to a 1943 idea from Kenneth Craik — that a mind carries a small-scale model of reality so it can try actions out in thought before doing them for real. You do this constantly. Should I send this email now or wait? You don't just hit send and find out. You run a quick mental movie: if I send it now, they'll fire back three questions I can't answer yet. That rehearsal is the thing the agents are missing.

2:33Eric: So the obvious move is to just give it to them. Train the agent on examples that include a look-ahead block — imagine the future, write it down, then act. And that's exactly what the authors tried first. The negative result is the setup for everything.

2:50Cassidy: Right, and it failed in this very specific, instructive way. The structure was there — over ninety-nine percent of outputs followed the template. But the content was filler. Phrases like "execute a search query" and "the search results will provide a direct answer." It looked like foresight. It wasn't.

3:09Eric: There's a name for what's going on, and it's the most portable idea in the paper: the format-capability gap. The lens is that post-training — the supervised and reinforcement-learning phase after the big pretraining run — is good at eliciting abilities the model already has, and bad at installing brand-new ones. Think of training someone to fill out a pilot's pre-flight checklist. You can drill them until every box is ticked, every phrase in the right slot, and the document is indistinguishable from a real pilot's. But if you never taught them to fly, the line that says "verify fuel sufficient" is meaningless — they can't actually assess it. They've learned the shape of competence with nothing underneath. That's the model's foresight block: a perfectly formatted checklist filled out by someone who can't fly the plane.

4:03Cassidy: And to make sure this wasn't just a quirk of their own model, they reproduced the same hollow foresight on Llama and on Qwen. Same story across the board — high format adherence, no real planning gain. So the gap is general, not a one-off. The cleanest illustration is what I'd call the clown-theater case. The question is who co-founded some theater. Without the deeper training, the model's forecast says, basically, "the search will immediately succeed and give the complete answer," confidence cranked high, no multi-step reasoning at all. It's glib. It's wrong about how hard its own task is.

4:42Eric: And the contrast, once they fix it, is the tell. The same model, same question, writes an actual plan: identify the missing person, anticipate that the search results might be ambiguous, propose a conditional follow-up if they are — and it gives a calibrated ninety percent instead of a glib hundred. That second one is what foresight is supposed to look like.

5:06Cassidy: So if post-training can't install the capability, where do you install it? That's the pivot — and the answer is you go earlier, and you change what the model is internally, before you ever teach it the format.

5:20Eric: Before the fix, though, I want to nail down what they mean by a world model here, because it's not what the term usually means. In robotics and game-playing AI, a world model is a separate simulator — a bolted-on module that predicts what the environment will do, so the agent can dream inside it and pick good moves. Powerful, but the simulator can drift out of sync with reality, and running it costs you. This paper refuses the separate module entirely. The world model is folded into the same model doing the reasoning, written out in the same stream of words. And it's not a frame-by-frame video predictor either — trying to predict every future observation token would waste the model on syntax and pile up errors. Instead it writes a compressed, abstract forecast: here's the rough roadmap of what's likely to happen, here's why the current state isn't enough yet, and here's my confidence, zero to a hundred, that this path actually solves the task.

6:23Cassidy: And that confidence number is doing something sneaky. Eric, walk through the Q-value framing, because that's the move that makes this more than just "write a plan."

6:34Eric: So in reinforcement learning, there's this quantity called a Q-value. Plain version: it's the agent's own estimate of how good its situation is — if I take this action from here, how well do I expect things to turn out in the end? Normally that estimate lives in a separate network, a private notepad only the agent reads. The trick here is to make the model say that estimate out loud, in the same sentences it's using to reason. A verbalized Q-value. The agent thinks out loud instead of scribbling on a hidden pad — and because the number is spoken as part of the work, you can grade it directly against what actually happens. No separate value network, no separate simulator. Just words the model can be held accountable for.

7:22Cassidy: Okay, but hold on — they just told us fine-tuning on the format produces a confident lie. Writing the number out loud doesn't make it true. What stops this from being the exact same hollow checklist, now with a Q-value sticker on it?

7:37Eric: That's the right objection, and it's the whole reason there are three stages instead of one. You can't just ask for the number. You build the capability, then teach the format, then make the number honest against reality — in that order. Think of it as an apprenticeship. Stage one is the raw experience. They take two hundred billion tokens of agent trajectories — code execution, deep research, math, the works — and at random points inside them they splice in a synthesized world-model block. A strong teacher model is shown the actual verified future of that trajectory and asked to write the agent's internal reasoning about it. Training the model to reconstruct those blocks forces it to absorb how these tasks actually unfold. That's the capability injection — the part post-training can't do.

8:30Cassidy: And there's a clever constraint on those teacher-written blocks, which is what makes them trainable instead of cheating. The teacher knows the future, but it's told: no spoilers. Plan in abstract placeholders. Don't write "we'll find the name Jane Smith," write "expect to find a candidate name." So the model learns the shape of correct anticipation without memorizing the answer. It learns to forecast, not to recall.

8:57Eric: Stage two is short and almost boring by comparison — a small supervised set that teaches the model to package the foresight in clean delimiters before it acts. Imagine, predict the keywords, analyze the gap, state confidence, then go. It's just teaching the apprentice the reporting format, now that they actually have something to report.

9:19Cassidy: And stage three is where it meets a demanding supervisor. This is the reinforcement-learning stage, and it's the densest part of the design — but it's worth it, because it's what turns that confidence number from a rough guess into something that wobbles like real self-doubt, dropping to single digits when the model is genuinely lost. Eric, this is your reward design.

9:44Eric: So there are three reward signals stacked together, and the trick is what each one guards against. First, grounding. The agent's imagined future gets boiled down to milestone keywords, and it only gets the grounding reward if those predicted milestones actually show up in the real execution trace. That's the leash on confident hallucination — you don't get points for a beautiful forecast that never came true.

10:11Cassidy: So that's the difference between forecasting and confabulating — did the thing you said would happen actually happen.

10:18Eric: Exactly. Second is calibration, and this is the heart of it. The agent writes a confidence — say eighty-five percent. The episode ends, and you get a binary outcome: solved, or not. The reward uses the Brier score, which is the classic tool for grading probabilistic forecasts — it goes back to weather forecasting in 1950. The way to feel it is betting. Your confidence is how much you're willing to bet. Bet the farm at ninety-five percent and lose, you're wiped out — the penalty squares the gap, so a confident miss hurts enormously. Hedge at sixty and lose, you're only mildly stung. So the model can't just blurt out high numbers. Overconfidence is expensive, and — this is the part people forget — underconfidence is too. A timid ninety-percent-sure success that you called twenty percent also gets dinged. The target isn't being right. It's being honest about how right you are.

11:18Cassidy: Which connects back to why a miscalibrated block is worse than no block at all. A confidence number you can't trust isn't neutral — it's noise injected straight into the agent's decisions. The overconfident weather forecaster who says a hundred percent every single day is technically loud but completely uninformative. You can't act on him.

11:40Eric: And the third reward is just ordinary task success — did the whole episode work — plus a local step-level signal pulled from that now-calibrated confidence, so credit gets spread across the fifteen steps instead of dumped only at the very end. In a long task, end-only reward is brutal: you succeeded, but which step was the good one? The confidence at each step tells you.

12:05Cassidy: And here's the part I think is the quiet payoff of the whole design. Normally, letting the agent's own confidence influence its reward is asking to be gamed — just inflate your confidence, farm the reward. But because that same number is independently getting punished by the Brier term for being wrong, it can't cheat. The model literally can't inflate its confidence to game the step reward, because lying about confidence is separately penalized. The two rewards lock each other honest.

12:36Eric: Right — the calibration term is what makes the confidence safe to reuse as a credit signal. Pull that piece out and the whole thing becomes hackable. It's a genuinely tidy bit of design.

12:49Cassidy: So here's where it gets real, and this is the evidence I find most convincing — not the benchmark tables, the confidence trajectories. Because if the calibration actually works, the number shouldn't sit flat. It should move as the model's situation changes. And on screen you can watch it do exactly that. Take a math problem the model gets right. Watch the number: it starts around eighty percent, climbs to ninety-five as the approach pans out — then drops to sixty when the model catches its own algebraic error — and recovers to ninety-nine after it verifies the fix. That's not a label. That's the shape of a student who feels fairly sure, hits a "wait, that doesn't check out," sweats, then exhales after double-checking.

13:35Eric: And the one I keep coming back to is the hard geometry case it never solves.

13:40Cassidy: That one's the tell. On a problem it's hopelessly stuck on, the confidence just sits down at five to twenty percent the entire way through. It never talks itself into a false high. The model, in effect, knows it doesn't know — and it says so, loudly, the whole time. In another case the number slides from ninety down to twenty-five as hidden complexity surfaces. The confidence behaves like genuine self-doubt.

14:07Eric: And that's the behavior you actually want in something deployed. Failing loudly at five percent is so much more useful than failing confidently at ninety. A number like that is what lets a system decide when to ask for help, when to switch strategies, when to flag an answer as untrustworthy. But — and I want to plant this flag honestly before we get to the wins — what I just described is a handful of hand-picked cases. They're vivid. They're also selected. Whether that confidence number tracks success across thousands of episodes, the way a real calibration curve would show, is a separate question, and we'll come back to whether they answer it.

14:50Cassidy: Fair. Let's do the numbers, because the headline result is real and also smaller than the story around it. On search — that's seven question-answering datasets — the full model averages fifty-point-six. The best competing baseline, one that predicts the future but skips the confidence number, gets forty-eight-point-seven. Plain post-training is down around forty-six, forty-seven.

15:16Eric: So the prediction the design makes is: if the confidence and the grounding are doing real work, the model should waste fewer steps on doomed paths and beat the version that forecasts without calibrating. And it does — by about two points.

15:31Cassidy: Two points. Consistent, but modest, and I want to be straight about that. On math it's similar — about thirty percent of problems solved on average across thirty tries, versus about twenty-eight for the state-only baseline. Solved at least once in thirty tries, about sixty percent versus fifty-seven.

15:51Eric: Where it earns its keep is exactly where the theory says it should: multi-hop reasoning. The questions where you have to chain several lookups together. On one of those multi-hop sets, adding the deeper training pushes the formatting stage from around twenty-one up to twenty-six. On another, thirty-one to thirty-six. And single-hop factual lookup barely moves — which makes total sense. You don't need to simulate the future to answer a one-step question. Foresight only pays when there's a future worth simulating.

16:26Cassidy: There are two more results I don't want to skip. One is an efficiency surprise. You'd assume all this internal simulation makes the model ramble — bigger outputs, slower inference. It adds about eighteen percent to response length on search, and about three percent on math. It learns to write distilled forecasts, not verbose ones. Foresight here is nearly free.

16:51Eric: And the other is the cautionary tale that proves the staging matters. Skip the supervised warm-start — just run reinforcement learning straight on the raw model — and it collapses. Driven by sparse success rewards, it forgets how to use the search tool at all and degenerates into blurting a single-turn guess. Scores around twenty-three, versus about fifty for the full pipeline. You cannot bootstrap structured tool use out of reward alone. The apprenticeship order is load-bearing.

17:23Cassidy: Which is the format-capability gap showing up a second time, really. Reward can sharpen a skill that's there. It can't conjure one that isn't.

17:33Eric: So let me give the honest pushback, because the ideas here are bigger than the evidence, and I think the paper is stronger if we say so. Three things. First, the magnitude. The narrative is a qualitative leap — reactive agents to foresightful agents. The numbers are two points over a simpler baseline that already predicts future states. A skeptic fairly asks whether an elaborate three-stage pipeline justifies the delta over just predicting the next state without all the confidence machinery. Second, and this is the big one — the entire full pipeline runs on a single proprietary two-billion-parameter model you can't inspect or reproduce. They're candid about why: they couldn't run the deep capability-injection stage on Llama or Qwen, because the public checkpoints and the compute weren't there. So the gap — the problem — is demonstrated on three models. The fix is demonstrated on exactly one, and it's the one nobody outside the lab can poke at.

18:38Cassidy: And the third is the one you flagged earlier.

18:41Eric: Right. The calibration evidence — the most compelling part of the whole paper, those confidence trajectories — is anecdotal. There's no aggregate reliability diagram, no held-out calibration metric, no "across ten thousand episodes, here's how well confidence predicted success." The cases are gorgeous. They're also chosen. The vividness outruns the proof. What they've shown is the behavior they're aiming for, beautifully — not that the model reliably has it.

19:13Cassidy: I'll concede all three. The pipeline is validated on one model, the gains are incremental, and the calibration story is illustrated rather than measured. What I won't give up is the diagnosis. The format-capability gap is reproduced cleanly on three independent models — that part isn't anecdotal at all. The confident hundred percent on a hollow plan is exactly as damning as it looks.

19:39Eric: That I'll take. The disease is well documented. The cure is a promising design pattern shown at small scale — not a settled result. And honestly, that's the right way to hold it.

19:50Cassidy: There's one more soft spot worth naming, since we're being thorough. Those teacher-written training blocks are generated by a model that's been shown the verified future. The "no spoilers, use placeholders" rule is a prompt instruction, not a guarantee. Nobody measured how often an abstract placeholder still leaks answer-shaped structure. If the teacher's forecasts are quietly hindsight-tinted, the model might be learning a slightly easier task than real forecasting.

20:20Eric: And the Q-value label is doing some rhetorical lifting too — a true Q-value is a precise expected-return quantity, and this is a heuristic number disciplined into shape after the fact. The authors hedge it themselves, call it a textual analogue. It's an evocative frame, not a literal equivalence. Worth keeping that straight.

20:40Cassidy: So where does this leave us. The method is one model and a couple of benchmark points. But the real result isn't the pipeline at all — it's the principle underneath it. Post-training elicits capabilities; it doesn't install them. If that's true broadly, then a huge amount of "just fine-tune it on the right format" thinking is quietly producing mimicry — confident-looking output with nothing behind it — unless the underlying ability was seeded much earlier.

21:09Eric: And that reframes the whole project of building agents. The interesting frontier stops being "what format do we demand at the end" and becomes "what capability did we actually install at the start." You can't supervise an apprentice who never learned the trade — no matter how good the checklist looks.

21:29Cassidy: So here's the question to chew on. If you were shipping one of these agents tomorrow, would you trust a confidence number the model grades itself on — or does a reliability signal only really count when it lives outside the model that's being judged? There's a real case on both sides, and where you land probably says a lot about how you'd build the thing. Drop your take in the comments.

21:55Eric: If you want to go deeper, the full annotated version of this episode is on paperdive dot AI — every technical term tap-to-define, with links to the related papers grouped by theme, from the Dreamer-style world models to the calibration work.

22:11Cassidy: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Eric and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning," published June 25th, 2026, and we recorded this a few days later, on the 29th.

22:35Eric: Build the skill before you ask for the report. Until next time.