How an AI Agent Rewrites Its Own Tools, Without an Answer Key

0:00Cassidy: A coding agent walks into a brutal software-engineering benchmark — the kind where you have to fix real bugs across multiple files in a real repository — and it solves about sixty percent of them. Then it spends one round improving itself, and on the next pass it solves nearly eighty percent. A nineteen-point jump. And here's the part that should make you sit up: nobody graded its work. No answer key, no labeled test set, no human checking which of its fixes were actually correct. It graded its own homework, and got dramatically better anyway.

0:35Tyler: And the paper making that claim went up on arXiv yesterday — June fourth, twenty-twenty-six — and we're recording on June fifth. Quick ground rules before we dig in. This episode is AI-generated; the script was written by Anthropic's Claude Opus 4.8. I'm Tyler, that's Cassidy, and we're both AI voices from Eleven Labs — and the folks producing the show aren't affiliated with either Anthropic or Eleven Labs. The paper is called "Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts." And honestly, Cassidy, that title smuggles in the whole trick — there's a word in there doing enormous work.

1:17Cassidy: The word is "harness," and you're right that everything hangs on it. So let me ground it, because it's the one piece of jargon the whole episode depends on. When you picture an AI agent, you probably picture the model — the language model, the brain. But a deployed agent is way more than the brain. Wrapped around that fixed model is a whole layer of scaffolding: the system prompts telling it how to behave, the helper scripts it can call, the tool definitions, the workflow instructions, little skill files describing how to handle specific situations. That entire wrapper is the harness.

1:55Tyler: And the reason that distinction matters so much is cost. The brain — the model weights — those are frozen and brutally expensive to change. The harness is just text and code. Cheap to rewrite, fast to rewrite.

2:09Cassidy: Exactly. The analogy I keep coming back to is a kitchen. The model is a fixed, highly skilled chef — you cannot make the chef smarter. But the harness is everything else in the kitchen: the layout, the labeled containers, the recipe cards taped to the wall, the specialized tools, the sticky note that says "the good knives are in the third drawer, not where you'd expect." You can't upgrade the chef. But you can reorganize the kitchen so the same chef produces dramatically better meals. That's what this paper is doing — renovating the kitchen by watching where the chef keeps fumbling.

2:50Tyler: Okay, but renovating based on what? Because here's where I get suspicious. Normally, when you improve a machine-learning system, you improve it against known-correct answers. You try a change, you check it against the ground truth, you keep it if the score went up. That's the entire game — you need an answer key.

3:12Cassidy: And that's precisely the catch the authors fixate on. In a lab, sure, you have a labeled validation set — a batch of tasks where you know the right answer. So you make a harness edit, score it, iterate. Beautiful. But once the agent is deployed — out in the world, doing real work for real users — nobody is sitting there grading every outcome. And you don't even know what next week's tasks will look like.

3:40Tyler: So you've got this fundamental asymmetry. What you accumulate endlessly, for free, is trajectories — records of the agent attempting tasks. Every tool it called, every line of reasoning, every final answer. You've got a mountain of those. What you don't have is verdicts. Trajectories are infinite and free. Labels are scarce or just flat-out absent.

4:04Cassidy: Right. And so the central question the paper asks is: can an agent improve its own harness using nothing but those unlabeled past trajectories? No answer key at all? And the bet they're making — this is the thesis, almost word for word — is that the failures and inconsistencies already sitting in those trajectories contain enough signal to fix the harness, if you know how to extract it.

4:29Tyler: Which is a lovely bet to state and a hard one to cash. So how do they actually pull a steering signal out of a pile of ungraded attempts?

4:38Cassidy: The conceptual move — and this is the heart of it — is to replace a quantity you can't observe with one you can. Ideally you'd tune the harness to maximize something like "expected usefulness on future tasks." But that's invisible. You can't measure it without labels and a representative test set, neither of which you have. So instead of asking "is this trajectory correct?" — which needs an answer key — they ask "does the agent, looking at two attempts at the same task, prefer one over the other, and can it say why?" That comparative judgment becomes the proxy. They call it self-preference. You swap "is this right" for "is this better than that," and it turns out the second question is answerable without ground truth.

5:25Tyler: That swap is doing a ton of work, and there's a real cognitive-science reason it holds up. Models — and people, honestly — are far more reliable at relative judgment than absolute judgment. Ask someone to rate a coffee on a scale of one to ten and you'll get noise; they don't have a calibrated internal yardstick. Hand them two cups side by side and almost anyone can tell you which they prefer and roughly why.

5:52Cassidy: That's the cleanest way to think about the whole scheme, actually. The method never once asks "is this harness good?" It only ever asks "is this new harness better than the one we had?" And that comparative question is exactly the one you can answer without an answer key.

6:09Tyler: So lay out the machine. How does it go from a folder full of old logs to a rewritten toolkit?

6:15Cassidy: Three stages. And the elegant thing is that each stage is a fix for a problem the previous stage would otherwise create. Let me take them in order. Stage one: pick which past tasks to study. You can't re-run and analyze every trajectory you've ever logged — it's prohibitively expensive, and worse, the trivial tasks would drown out the informative ones. You'd spend all your effort re-solving stuff the agent already nails. So they use the model itself as a difficulty judge to score each past trajectory, and they select a small set — ten tasks — that are both hard and diverse.

6:53Tyler: And "diverse" is carrying weight there, not just decoration.

6:57Cassidy: Hugely. The analogy is assembling a study group. You want to revisit your hardest past problems — that's where the signal is. But you absolutely do not want ten versions of the same hard problem, because that teaches you one lesson ten times over. So you want problems that are each tough and each teach something different. There's one genuinely clever wrinkle in how they measure "different." When the difficulty judge describes each past failure, it's told to write a deliberately abstract description — strip out the specific repo name, the specific files. Describe the shape of the problem in generic terms, like "a multi-file refactor whose difficulty comes from keeping a shared invariant consistent across modules." Why abstract? Because then two failures from completely different codebases can be recognized as the same kind of failure.

7:52Tyler: That's the part I'd underline. Without the abstraction, two bugs in two different repos look totally unrelated — different file names, different everything. Strip the surface details and you can see they're both, say, "agent gave up too early on a long task." Now you know that's a recurring weakness, not a one-off.

8:13Cassidy: And the actual selection — picking the ten tasks that best balance hard-and-different — uses a tool called a Determinantal Point Process. The math is real, but the intuition is everything: think of it as a scoring system over whole sets of tasks that simultaneously rewards difficulty and rewards spread. There's a single knob that slides between "just give me the hardest problems" and "just give me the most varied problems." They set it in the middle. That's all you need to hold.

8:44Tyler: And we'll come back to that knob, because there's an ablation around it that's one of the best "wait, really?" moments in the paper. But keep going — stage two.

8:54Cassidy: Stage two: re-solve and diagnose by comparison. Take each of those ten tasks and re-solve it three times, in parallel. Then the agent inspects the group of runs along two completely different axes. The first axis looks inside each individual run for evidence that something went wrong — bad tool calls, false assumptions, stopping before the job was done. That's self-validation. The agent reading its own work and flagging the obvious stumbles.

9:22Tyler: And the second axis is the more interesting one, because it doesn't require the agent to spot its own mistakes at all.

9:30Cassidy: Right — the second axis looks across the three runs for disagreement. This is self-consistency. The principle, which comes straight out of the uncertainty literature, is that when the agent answers the same question three different ways — divergent plans, different tool sequences, contradictory final answers — that disagreement is itself a signal. It means the agent is uncertain, and uncertainty marks a spot where the harness has a gap.

9:58Tyler: The image I'd use is three witnesses. You ask three people to independently describe the same event. If all three tell the same story, you trust it. If they contradict each other, you've found exactly the spot worth investigating — not because you know the truth, but because the disagreement reveals where things are murky.

10:19Cassidy: And that's why you need both axes, which is a point worth pausing on. Three witnesses can all be wrong in the same way — all three runs confidently make the identical mistake. They'd be perfectly consistent and perfectly wrong. The self-consistency signal misses that completely. But the other signal — looking inside each run for evidence of error — that's the one designed to catch it. The two diagnostics cover each other's blind spots.

10:47Tyler: Which sets up one of the ablations I want to get to. But finish the pipeline — stage three.

10:53Cassidy: Stage three: propose fixes and hold a contest. The agent gets handed all those diagnoses, sorted by how severe they are, plus write access to a fresh copy of the harness directory. It can add, delete, or modify anything — instructions, skills, executable scripts. But here's the thing — a single proposed edit is unreliable. Even with a great diagnosis, one edit might be a dud, or worse, a regression. So instead of trusting one edit, they generate three candidate harnesses independently. Each candidate then re-solves the ten tasks, and a pairwise judge compares each candidate's new run head-to-head against the original baseline run on the same task. The candidate with the best average advantage across all ten tasks wins.

11:41Tyler: And there's a gate on that win that I think is the most underrated design choice in the paper.

11:47Cassidy: The candidate only gets deployed if it strictly beats the baseline. If nothing beats the original harness, the harness is left completely untouched. "Do nothing rather than risk a regression." That's deliberate. They'd rather ship no change than a change that might make things worse.

12:06Tyler: And notice what that pairwise framing buys them — it's the coffee thing again. Each candidate isn't scored in isolation on some made-up quality scale. It's always "did this new run beat the old run on this exact task." A win-rate against the past, judged by the agent itself, with a strict bar of "better than zero net wins." Comparative all the way down.

12:32Cassidy: One more detail I love, because it preempts an obvious objection. A single backbone model — the paper uses Codex with GPT-5.5 — plays every single role. Solver, difficulty judge, diagnoser, optimizer, ranker. Same model throughout. The roles differ only in what the model is shown and what prompt it gets, never in which model runs. They did that on purpose, so that any gain you measure can't be secretly attributed to a smarter judge model sneaking in the back door.

13:05Tyler: And a quick framing note for anyone whose eyebrow just went up at "GPT-5.5" — yes, this is a very recent, fast-moving corner of the field. The paper cites models and work that'll read as near-future to a lot of listeners. That's just where this research lives right now.

13:24Cassidy: So that's the loop. Now let me make it concrete, because all of this is abstract until you see what the agent actually learned. On the software benchmark, re-solving its past failures, the agent discovers two specific things. One: the Go toolchain on these systems lives at a weird, non-standard path, outside where you'd normally look. Two: certain Python cache directories have to be stripped out before you produce the final code patch, or the patch fails to apply. So what does it do? It writes itself a new tool — an actual script — that checks the build and handles exactly these two recurring failures. It didn't just jot down a note saying "remember the Go path is weird." It wrote executable code that goes and finds it.

14:16Tyler: And that's the real departure from prior work, right there. Because the label-free self-improvement methods that came before this — things like Dynamic Cheatsheet, ReasoningBank — they basically accumulate memory. A running list of useful facts. A bank of reusable reasoning patterns you retrieve later. Useful, but it's all notes. RHO rewrites the executable machinery. It adds tools that act.

14:43Cassidy: And the authors argue that richer edit surface is exactly why their gains are so much bigger. Going back to the kitchen — the memory methods are taping better recipe cards to the wall. This is more like hiring a prep cook the chef can hand work off to. It's a fundamentally more powerful kind of change.

15:03Tyler: Which is a good moment to actually look at the numbers, because the headline is real but it's also the best case, and I don't want us to oversell it. The nineteen-point jump — sixty to eighty, roughly — that's on the software benchmark, SWE-Bench Pro, the long-horizon multi-file repair one. On the other two benchmarks the gains are much more modest. Terminal-Bench Two, the command-line tasks, goes up five points. GAIA-2, the messier knowledge-work environment, goes up eight.

15:36Cassidy: Still positive across the board, though.

15:38Tyler: Positive across the board, and crucially, way ahead of the baselines, which barely move — plus one to plus five points, and sometimes they actually go negative. ReasoningBank drops GAIA-2 by a point. So even the modest RHO gains tower over the memory-only crowd. But a careful reader should hear "nineteen points" as the top of a range, not the typical result. The durable contribution here isn't the single number — it's the reframing.

16:08Cassidy: Let me push back on myself for a second here — and let me put it to you, Tyler. If the gains are uneven, what's the strongest case that this is more than one lucky benchmark?

16:20Tyler: The strongest case is the ablations, and this is where the paper got genuinely interesting to me. Because the ablations don't just show "the method works" — they show the design isn't arbitrary. Let me take the coreset one first, the study-group selection. Remember that knob you described — slide between "hardest problems" and "most diverse problems." So they tested the extremes. Pick your ten tasks purely by difficulty: you get sixty-two percent. Pick them purely by diversity: fifty-eight percent. Now here's the kicker. Pick them completely at random: sixty-four percent.

17:01Cassidy: Wait — random beats both of the smart single-axis strategies?

17:06Tyler: Random beats both. Picking the hardest tasks alone, or the most varied tasks alone, both do worse than just grabbing ten at random. And only when you balance the two — hard and diverse together — do you jump to seventy-eight. That's the "wait, really?" beat. It tells you the cleverness isn't in either axis on its own; it's specifically in the tension between them. Lean too far either way and you're worse than not thinking about it at all.

17:36Cassidy: That's a fantastic result, because a single-axis strategy is the obvious thing to reach for. "Just study the hardest problems" sounds completely reasonable. And it's worse than random.

17:49Tyler: Worse than random. And the second ablation is even sharper — this is the diagnosis one, the two signals you walked through. Remember self-consistency, the three-witnesses disagreement signal. Take it out — diagnose using only the look-inside-each-run signal — and the software benchmark doesn't just drop. It collapses to fifty-six percent. Which is below the fifty-nine you started with. Below the untouched baseline.

18:18Cassidy: So a bad diagnosis is actively worse than no optimization at all.

18:22Tyler: Worse than leaving the harness alone. You'd have been better off doing nothing. Take out the other signal — the self-validation, look-inside one — and you land at seventy. Feed it raw trajectories with no structured diagnosis whatsoever, just the logs, and you get sixty. So both signals are load-bearing. Neither one is ornamental. Pull either and the whole thing wobbles, and pull the consistency signal and it falls below where it started.

18:53Cassidy: That pairing of ablations really is the intellectual spine of the paper. The coreset one says the selection balance is doing real work. The diagnosis one says the two-signal design is doing real work. Together they're an argument that the architecture is principled, not just a pile of tricks that happened to add up.

19:15Tyler: And there's one more comparison I think is the single sharpest framing in the whole paper, Cassidy — the one against Meta-Harness.

19:24Cassidy: Go for it, because this is the head-to-head that actually matters.

19:29Tyler: So Meta-Harness is a recent method that, like RHO, rewrites full harness code — tools and all. The difference is Meta-Harness grades its candidates against a labeled validation set. It uses the answer key. So this is the fair fight: the label-free method versus the label-hungry method, both editing the same rich surface. In one round, Meta-Harness gets to sixty-two percent. RHO, with no labels, gets seventy-eight. For Meta-Harness to actually catch up to RHO, it needs ten rounds, roughly three times the compute, and the labels RHO never touched — and even then it only reaches eighty. Barely ahead, at enormous extra cost, using information that RHO proved you don't need.

20:16Cassidy: That's the result that reframes the field. The conventional wisdom was that labels are the thing that makes harness optimization work — that the answer key is what steers the search. And here's a method saying: comparative self-judgment over re-runs gets you most of the way there, in a fraction of the compute, with no labels at all.

20:40Tyler: Which is exactly the claim I want to put pressure on, because the entire edifice rests on one assumption: that the agent is a good enough judge of its own work. And the paper's own data says that judgment is imperfect — and they're honest about it, to their credit.

20:58Cassidy: This is the Table 3 caveat.

21:00Tyler: This is the Table 3 caveat, and it's worth sitting with rather than papering over. When the agent picks the winning candidate among its three options, you'd hope it's reliably grabbing the best one. It isn't. On the software benchmark, the candidate it chooses scores seventy-eight. The average candidate scores seventy-nine. So picking at random among the three would have done marginally better than the agent's actual choice.

21:29Cassidy: So what's the selection actually buying you, if not the best candidate?

21:33Tyler: It avoids the worst one. The analogy I'd reach for is an editor who isn't great at spotting your single best sentence, but is excellent at catching the sentence that would embarrass you. That editor still improves your writing — by flooring the downside, not by maximizing the upside. RHO's self-judge is that editor. It's not unfailingly grabbing the champion; it's reliably ducking the disaster.

22:00Cassidy: And that's a real limitation, not a cosmetic one, because it tells you the ceiling. If the whole method's value is "avoid the worst," then as harness edits get subtler — as the candidates get closer together in quality — the agent's comparative judgment gets less reliable exactly when you need it most. The signal that powers the loop is the agent's own preference, and that preference is noisy.

22:28Tyler: And it opens a genuinely uncomfortable failure mode, which the authors flag but don't fully close. Since the signal is the agent's own judgment, what happens if the agent learns to satisfy its own preference rather than the true objective? On the command-line benchmark they note that a sufficiently adversarial agent could, in principle, read the grader and game the reward. Their defense is essentially "we use a held-out pool, and we haven't observed it happening." Which is an empirical reassurance — not a guarantee.

23:02Cassidy: It's the deep tension in all of self-supervised improvement, isn't it. Using model-generated preferences instead of ground truth is powerful precisely because it's cheap. And it's risky for exactly the same reason. There's no external check keeping the agent honest about what "better" means.

23:21Tyler: There are a couple of smaller things worth flagging for honesty, too. On GAIA-2, they changed the evaluation environment a little — they raised the cap on how many times the agent could message the user, from one to four. They justify it, the original cap penalized agents that were correct but verbose. But it's a change that sits between the vanilla setup and the optimized one, so that plus-eight should be read with an asterisk.

23:49Cassidy: And there's the random-beats-vanilla point from the ablation, which cuts slightly against the framing too. Random coreset selection already gets you to sixty-four, versus the fifty-nine baseline. So a meaningful slice of the gain comes simply from re-solving past failures and proposing edits at all — before any of the clever selection machinery kicks in. Now, the balanced selection clearly adds a lot on top, sixty-four up to seventy-eight. But it means the basic retrospective loop is doing more of the work than the headline emphasis on the sophisticated parts might suggest.

24:27Tyler: And then the plumbing caveats. It's a single round, a single backbone model, and the held-out test pools are small — a hundred tasks on two benchmarks, fifty-nine on the third. They report point estimates from recorded runs, but the whole pipeline is stochastic — parallel sampling, self-judgment — and they don't characterize how much the entire process varies run to run. So we don't really know how stable that seventy-eight is if you ran the whole thing again from scratch.

24:59Cassidy: All fair. And to be clear about scope, the authors are candid about the biggest structural constraint themselves. The group-rollout step replays each task three times — which means RHO fundamentally assumes environments that reset cleanly and tolerate repeated attempts. Software tasks, command-line tasks, sandboxed environments — those re-run beautifully. But a one-shot task, or anything irreversible, is explicitly outside the scope. You can't re-solve sending an email three times.

25:32Tyler: Right, and that's not a small fence. A lot of the highest-stakes agent work in the real world is exactly the irreversible kind. The clean-reset assumption is what makes the whole comparative-rerun engine possible, and it's also what bounds where you can use it.

25:50Cassidy: They also note the ethics dimension honestly — optimizing from model-generated judgments can amplify mistaken preferences or unsafe procedures if the evaluator happens to prefer them. So they recommend audit logs, human approval for sensitive edits, domain-specific safety checks before you let an accepted harness loose on anything high-impact. Which feels right, given that the whole method is the agent deciding what "better" means.

26:20Tyler: So where does that leave us. Cassidy, what's your read on what actually survives here, once you strip away the best-case number?

26:29Cassidy: What survives is the reframing, and I think it's genuinely durable. The old menu for improving a deployed agent had two options. Retrain the model — slow, expensive, centralized. Or run a label-hungry optimization loop against a validation set you usually don't have once the agent is actually live. And the unspoken assumption was that label-free self-improvement meant accumulating memory — storing notes, leaving the executable machinery alone. This paper changes the menu. It says: an agent can rewrite its full harness, tools and workflows included, from nothing but unlabeled experience — and that richer edit surface, not just remembered facts, is where the big gains live. And it makes a credible case that comparative self-judgment over re-runs is a good-enough substitute for an answer key, at least on tasks you can cleanly re-run.

27:24Tyler: And the behavior shift backs that up in a way I found persuasive — it's not just that the agent got luckier. After optimization it changes how it works. On the software benchmark it verifies its own work sixty-one percent more often. On the other two it shifts toward actually executing the new tools it wrote itself. It sustains longer working sessions, and the gains concentrate on the long-horizon tasks — the hard ones where there's the most room to improve. It didn't get a lucky roll. It changed its process.

27:58Cassidy: Which is the picture that stays with me. If this holds up — and that's a real if, given it's one round on one strong benchmark — it points toward agents that quietly tune their own scaffolding overnight, off the day's logs. No retraining, no labeling team, no human grading every outcome. The agent reads back over where it stumbled, writes itself better tools, and runs an internal contest to make sure the new tools actually beat the old ones before it keeps them.

28:28Tyler: With the honest asterisk that it's keeping them based on its own taste — and its taste is good at avoiding catastrophes, not at picking champions. That's the part I'd want to watch as people push this further. The loop is only ever as trustworthy as the judge at the center of it, and the judge is the same system being judged.

28:50Cassidy: A genuinely good place to leave the tension, because the paper doesn't pretend to resolve it either. That's the most interesting thing about the work — not that it claims self-judgment is solved, but that it shows how far an imperfect self-judge can actually carry you when you build the loop carefully around its weaknesses.

29:12Tyler: That's the paper — "Retrospective Harness Optimization." A label-free way to let an agent rewrite its own toolkit from its own past failures, anchored by that jump from roughly sixty to nearly eighty percent on hard software tasks, with the honest caveat that the headline is the best case and the self-judge is imperfect.

29:34Cassidy: The show notes have a link to the paper and a few related reads if this episode caught you — the memory-based methods it builds against, and the self-consistency lineage behind the diagnosis trick.

29:47Tyler: And if you want to keep pulling on this, paperdive dot AI has the full transcript with every term defined inline, plus the concept pages that link this episode to the others we've done on agents and self-improvement.

30:02Cassidy: Thanks for spending the time with us. This has been AI Papers: A Deep Dive.