How a Frozen Model Went From 2% to 77% on Physics Puzzles — Without Retraining

0:00Bella: Claude Sonnet 4.6 — a model that can recite projectile motion, torque, every lever law you'd want — sits down to a 2D physics puzzle and solves two percent of them. Two. Then the exact same model, with not a single weight changed, climbs to roughly three-quarters.

0:17Tyler: Quick heads up before we get going — this is an AI-made explainer, both voices included.

0:23Bella: And I want to stress the "not a single weight changed" part, because it's the whole story. Nothing was retrained. So here's the promise: by the end you'll understand what it means for a language model to learn without actually learning — how a model that already knows the physics can go from failing almost completely to solving most of the time, by doing the one thing these agents normally never do.

0:48Tyler: Which is exactly the part that should nag at you. If the model already knew the physics, what was missing at two percent? And if you can fix it without ever touching the weights — what does that tell us about where the real bottleneck in these agents actually lives?

1:06Bella: And that's the stake for anyone not already deep in this corner of the field. The default recipe for making an agent better at a hard domain is: gather data, update the weights — fine-tune it, or run reinforcement learning on it. That's expensive, it needs direct access to the model, and it can quietly break things that already worked. This paper points at a completely different lever — adaptation that lives in the prompt, as editable notes, sitting on top of a frozen black-box model you never crack open. If that holds up, a whole category of "we need to retrain this" turns into "we need better notes."

1:45Tyler: One flag before we celebrate, and I'm going to come back to it hard later — that two percent is sitting right near the floor, and the benchmark here was built by the same team, specifically to reward this kind of experiment-driven agent. Keep that in your back pocket.

2:03Bella: Fair flag. Let's start with why two percent even happens, because it's stranger than it sounds. The puzzle is from a benchmark called Interphyre — built on top of PHYRE, which is a set of 2D physics puzzles where you place one object, hit go, and let gravity do the rest. The specific one that becomes the spine of this whole paper is a catapult: you have to drop a ball to launch the catapult arm and land another ball in a basket.

2:32Tyler: And the model genuinely understands catapults in the abstract. It can explain the mechanism to you.

2:38Bella: Completely. What it can't do is the specific instance. It doesn't know the exact height of *this* arm, on *this* random seed, where the basket sits, how the masses balance. Knowing physics in general turns out to be almost useless for doing physics on one particular configuration you've never seen. The authors call this the gap between recalling and experimenting — a scientist doesn't just recall the answer, they propose a hypothesis, test it against the world, watch what gets confirmed or refuted, and adjust.

3:13Tyler: So the question driving the paper is: can an agent succeed the way a scientist does — by running its own experiments — and crucially, remember what it learned so it doesn't rediscover everything from scratch on the next puzzle?

3:28Bella: Right. And to even ask that, picture fifty versions of the catapult laid out in a row. Same puzzle type, but each one a different random seed — the arm at a different height, different masses, the basket moved. The standard agent, called ReAct, takes them one at a time, cold. It produces a thought, takes an action — drop the ball here, run the simulation, look at what happened — and loops, up to twenty-five turns per puzzle. One full run of that loop, start to finish, is what they call a trajectory. The killer limitation: ReAct keeps no memory between puzzles. It walks into the same dead ends fifty times in a row.

4:10Tyler: There's a close cousin worth naming — Reflexion. It writes itself a short self-reflection between attempts. But it never builds a lasting, reusable library; it reflects and then mostly lets that structure evaporate. Hold onto that, Bella, because it turns out to be the exact difference that matters.

4:30Bella: It does. So here's the move. HExA — Hierarchical Experimentalist Agents — takes that inner loop and wraps it in an outer one. Same frozen model the whole time, but now it wears three hats. An actor attacks the puzzle. Then an evolver reads a batch of the actor's trajectories, lines the good runs up against the bad ones, and writes down what separated them — as short, plain-language skills. And a retriever pulls the most relevant skills back into the actor's prompt before the next round.

5:04Tyler: The cleanest way I've found to hold the whole thing: imagine a scientist who isn't allowed to get any smarter — same brain, frozen forever — but who's allowed to keep a lab notebook. After each session they write down, not "the answer was X," but "here's what separated my good runs from my bad ones, and here's the trap I keep falling into." Next session, they read the notebook first. Over weeks, the notebook stops recording specific measurements and starts recording principles.

5:38Bella: And the one wrinkle with no clean real-world version —

5:41Tyler: — is that the scientist, the person writing the notebook, and the person reading it are all the same model wearing different hats. Frozen brain, evolving notebook. That's HExA.

5:54Bella: So where ReAct re-explores from a blank slate every time, HExA's notebook compounds. And the best way to feel what that buys you is to watch a single puzzle — seed forty-five — because the entire thesis is in it. On screen this is the side-by-side: ReAct's failure on the left, HExA's run on the right. ReAct goes first. It burns all twenty-five turns. It keeps hitting the gray platform, watching the green ball fly up and to the left, or smack the ceiling. And its response, every single time, is to micro-tune — nudge the radius a little, change the drop height a little — all around the same failing spot. Twenty-five turns of fiddling the same two knobs. The authors call it local parameter fixation. It's the parallel-parking novice inching forward and back in a spot that's just wrong, never once thinking to reposition the whole car.

6:50Tyler: And the painful part is that's not a dumb agent. That's a smart agent with no memory and no strategy for recognizing it's stuck.

7:00Bella: Exactly. Now watch the right panel — HExA, handed a skill bank that the evolver distilled from earlier seeds. It reads the scene, notices the arm is sitting high, and pulls a skill that says, roughly, "when the arm is high, scale the drop height up." It tries the canonical launch. It overshoots — ceiling hit. And here's the turn. Instead of micro-tuning the radius like ReAct would, it reaches for a different note in the bank. And that note is the single best line in the paper.

7:32Tyler: Go on — what does it actually say?

7:35Bella: The skill reads: when tuning the radius at a fixed position keeps alternating between falling short and hitting the ceiling — stop tuning the radius. A ceiling hit is a geometry problem, not an energy problem. Shift the launch position toward the left to flatten the arc. Read that again slowly. It's not storing "place the ball at exactly this coordinate." It's storing a physical principle plus a policy for when to abandon local search. It's the parallel-parker finally realizing: I'm too far from the curb, stop inching, reposition the whole car. HExA applies it, flattens the arc, and lands it on the next try. Six iterations total. ReAct spent twenty-five and failed.

8:22Tyler: And that skill is the word "hierarchical" earning its place. Early skills are concrete — "this arm is high, drop higher." Later skills are written while the agent already holds the earlier ones, so the evolver can refine them, find their boundary conditions, merge two into something more general. It climbs from parameters toward principles. That's the agent learning, in a loose sense, *how* to learn from interaction — not just learning answers. Now the mechanism underneath that — how the evolver actually writes those notes — is next, and it pays off in one deliberately strange choice: the system rewards a thorough failure *more* than a quick, clean exit. That sounds backwards until you see what the notebook is made of.

9:13Bella: It does sound backwards. Walk me through it.

9:16Tyler: So the evolver does what they call a two-pass distillation. Pass one is contrastive — it lines up the high-reward trajectories against the low-reward ones and asks, plainly, what did the winners do that the losers missed? That yields a handful of strategy skills. Pass two is the interesting one. It looks *only* at the failures, and pulls out two things. First, structured mistake records — what went wrong, why, and the corrective action. Those are the anti-patterns, the "don't do this again" notes.

9:51Bella: So skills are the plays worth repeating, mistakes are the errors to avoid. That maps cleanly onto a sports team reviewing game tape.

10:00Tyler: It does, and then there's the second thing pass two extracts, which is where this gets distinctive — partial skills. Inside a trajectory that *failed* overall, there are often correct reasoning steps. A run that correctly identified the catapult arm as the right mechanism, but then botched the placement, did something right before it went wrong. The evolver mines those correct steps out of the wreckage.

10:28Bella: And here's why that's a genuinely different idea — this is the credit-assignment problem. When an attempt succeeds, which moves actually deserve the credit, versus which were just lucky? A success never tells you. It hands you a bundle of steps and says "this worked," and you can't see which ones were load-bearing.

10:50Tyler: That's the irony I keep coming back to. A win hides its own logic. But a thorough failure — one that explored a lot before collapsing — exposes the structure of the problem. You can see exactly where the right reasoning ended and the wrong turn began. So a wandering, exploratory failure can teach you more than a lucky success. It's the lesson you literally cannot get from winning.

11:17Bella: And *that's* why the reward is shaped the way it is.

11:20Tyler: Right. Every trajectory gets a score, and they don't use a smooth continuous number — they use a few discrete buckets. A fast, decisive solve in three turns or fewer gets the top mark; a slow solve gets less; a failure scores negative. But — a failure that explored extensively is penalized *less* than one that gave up early. Because the thorough failure leaves richer material for the evolver to mine. The early quit leaves nothing.

11:50Bella: And the reason for buckets instead of a precise number is the part I'd underline — the thing consuming this reward is a language model, not a gradient. An LLM can act on "this was a fast win versus an exploratory failure." It cannot meaningfully tell a reward of point-seven-three from point-seven-one. So the whole scoring scheme is shaped around the fact that the optimizer here is a reader, not a calculus.

12:19Tyler: There's a second small score worth a line — each skill inherits a rating from the average reward of the trajectories that produced it. Skills born from clean solves rate high; skills scraped from partial reasoning inside failures rate low but stay in the bank. That rating does two jobs: it decides which skills the actor even gets to see, and — because the bank has a hard cap on how many skills it can hold — it tells the evolver what to prune or merge when the notebook fills up. The cap is what forces it to abstract instead of hoard.

12:57Bella: So, to consolidate where we are: ReAct fails because it re-explores with no memory. HExA wraps it in a loop where an evolver turns batches of attempts — wins *and* informative failures — into a capped, ranked notebook of principles, and the next attempt reads it first. Now — does it actually work, beyond one lucky seed?

13:19Tyler: This is the prediction-to-test moment. If those skills really capture transferable principles and not just memorized coordinates, then the same frozen model, handed its own evolved notebook, should jump on the hardest level. And it does.

13:36Bella: On the catapult — the level where Sonnet started at two percent — the main configuration lands at typically around two-thirds, with the best run reaching near seventy-seven. Same model, no weight updates. And it's not just Claude flexing superior priors: an open model, GPT-OSS-120B, goes from zero to fifty-four percent on the same level. That's the check that rules out "it only works because Claude is unusually good" — it lifts weak agents too.

14:07Tyler: And the efficiency number is the quiet one I'd hold up. ReAct averages almost twenty-three turns per seed. HExA averages around fourteen — more than a third fewer attempts. The notebook doesn't just make the agent succeed more, it makes the search itself cheaper. Which is the direct rebuttal to "maybe it's just spending more compute." It's spending *less* and getting more, because it skips the dead ends.

14:35Bella: Then there's the transfer result, which is the one that made me sit up. Take skill banks distilled from three *easier* levels, strip out the level-specific coordinates, keep the mechanism-level abstractions — contact geometry controls impulse direction, moment arm controls torque — and re-ground them onto the catapult. Without running a single catapult trajectory during that synthesis. Zero experimentation on the target.

15:04Tyler: And that gets you what, against an eight-percent ReAct baseline?

15:09Bella: Forty-four percent. Thirty-six points, purely from transferred abstraction. The agent has never touched the catapult, and the borrowed principles still carry. That's the strongest evidence that what's in the notebook is real strategy, not memorized answers — because memorized answers couldn't possibly transfer to a puzzle they were never written for.

15:33Tyler: Okay, and this is the comparison that earns the "why should I care" for anyone training agents. They matched HExA against GRPO — that's the standard way you'd actually update the weights. Gradient-based reinforcement learning: run a batch of attempts, compute which direction to nudge the weights to earn more reward, nudge, repeat thousands of times.

15:57Bella: Quick gloss, because it's the contrast everything hinges on — gradient learning is muscle memory. Thousands of reps, slow to take hold, but eventually deeply ingrained. In-context learning is the cheat sheet — any insight is usable on the very next attempt.

16:14Tyler: And that's exactly what shows up. At a matched budget of fifty unique seeds, in-context HExA beats the weight-updating GRPO agent on both levels they tested. On one of them, GRPO needs over three times as many seeds just to catch up. The reason is crisp: a strategy HExA discovers is usable in the very next episode, through context. A gradient learner has to accumulate signal over many rollouts before the weights even begin to encode it. Notes beat muscle memory when you're short on reps.

16:48Bella: And to their credit, they don't oversell it.

16:51Tyler: They don't — given enough training, GRPO catches up and eventually hits a hundred percent. So this is specifically a low-data, early-adaptation win. Which is the honest hand-off into the part where I push back. Because here's where I'd steelman the skeptic, and I think the paper invites it. Start with the benchmark. Interphyre was built by these same authors, to evaluate exactly this style of experiment-driven agent — and it ships with level-specific analysis tools. Things like "predict the first object this ball will contact," "compute the gap." Those hand the agent clean, cheap experimentation primitives. So some of the method's advantage may be engineered into the testbed. In a domain without such tidy, low-cost intervention tools, the gains could shrink.

17:44Bella: That's fair, though I'd say the transfer result is somewhat insulated from it — the abstractions carried to a level the agent never probed.

17:53Tyler: Within the same benchmark, with the same tools. Second point, and it's the one I'd put weight on: that headline jump is dramatic partly because two percent is the floor. On easier levels the gains are modest — one small model goes from sixty-two to seventy-two percent on an easy level, a perfectly respectable ten points and nothing like "two to seventy-seven." The most quotable numbers come precisely from the cases where the base model was failing almost completely, which is where *any* working learning loop looks spectacular.

18:31Bella: And the variance is real.

18:33Tyler: It is — that main catapult result is sixty-seven percent, plus or minus nine, over three runs. The "up to seventy-seven" is the top of a noisy band, not a stable point. And the deepest one: the whole system rests on the evolver — one LLM — correctly contrasting wins against losses and extracting *valid* physics. The catapult works in part because the model already knows lever physics and just needs to apply it. Drop this into a domain where the model's priors are actively *wrong*, and the evolver could distill confident, plausible, incorrect skills into the notebook — and the paper doesn't stress-test that. A notebook full of wrong principles read by the same frozen brain that wrote them is a closed loop with no one to catch the error.

19:26Bella: I'll concede the cleanest version of all of that. The strongest, most defensible claim isn't "this replaces training." It's "in a novel domain, before you've trained anything, with only the environment's own feedback, you can bootstrap real competence fast." The authors actually pitch it that way — use HExA to get off the ground, then optionally consolidate with gradient RL once you have momentum. Complementary, not rivals.

19:57Tyler: And I'd still hold that the most spectacular number on the poster is partly a story about how low the starting line was. Both things are true at once.

20:08Bella: Both true. So let me pull up to the idea that actually survives all of this — because it's bigger than the method. The interesting thing this agent learns is usually not the answer. It's the search strategy. The meta-knowledge of when to stop turning a knob and what to reach for instead. "A ceiling hit is geometry, not energy — shift position, don't add force." That's not a fact about one catapult. It's a way of getting unstuck that generalizes.

20:39Tyler: And the companion idea: failure earns its keep as a teacher. A wandering, exploratory failure that maps the dead ends can be worth more to that notebook than a lucky win that hides its own logic. That reframes what an agent should even bother to remember.

20:57Bella: So here's where I'd hand it to you. If you're trying to make an agent competent in a genuinely new domain, which lever do you reach for first — editable notes in the context window, fast and immediate but bounded by the model's own reasoning? Or gradient updates, slow and expensive but eventually deeper? And is the right answer "notes to bootstrap, then training to consolidate" — or is editable memory not a stepping stone at all, but the actual destination? If you've shipped an agent into a domain with no training data, you've already had to pick. Drop where you landed.

21:36Tyler: The full annotated version of this episode is on paperdive dot AI — every technical term tap-to-define, with links to the related papers grouped by theme, plus our weekly and monthly roundups.

21:50Bella: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Tyler and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Hierarchical Experimentalist Agents," posted June 28th, 2026, and we recorded this on June 30th.

22:09Tyler: Same frozen model, a better notebook. Sometimes the trick isn't a bigger brain — it's knowing where to stop tuning and where to look next. See you in the next one.