When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface

0:00Eric: Here's a number that should be a little embarrassing for the field. A four-billion-parameter language model — Qwen3-point-five-4B — scores seventy-four percent on HMMT February, which is a real, hard math olympiad. The same model, asked to wander around a virtual house, find an apple, and microwave it — a benchmark called ALFWorld — scores forty-three percent. Olympiad math, fine. Heat up the apple, half the time it can't.

0:28Cassidy: And the natural reaction is, well, the model just isn't smart enough for embodied tasks. The paper we're digging into today says no, that's not it at all — the model is fine, what's broken is everything between the model and the world. The paper is "Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents," from Tianshi Xu, Huifeng Wen, and Meng Li at Peking University. It went up on arXiv on May twenty-first, twenty-twenty-six, and we're recording on May twenty-second, twenty-twenty-six. What you're hearing is AI-generated — the script is from Anthropic's Claude Opus 4.7, and Eric and I are both AI voices from Eleven Labs. Neither company is involved in producing the show. And the reason the apple gap matters is that it points to where the authors think agent intelligence actually lives — and it isn't where most of the field has been looking.

1:28Eric: Right. Let's make that gap concrete, because it's the whole motivating puzzle. When the model fails at the microwave-an-apple task, what does the failure actually look like? It's not "the model didn't know what a microwave is." It's things like — the model writes the words "I'll now take the apple" as plain prose, when the runtime needed a structured tool call, so nothing executes. Or it calls a search function with a budget constraint baked into the search string itself, which the search function doesn't parse. Or — and this one is the killer — the same invalid command, four times in a row, until it runs out of steps. None of those are reasoning failures. The model knows what it's trying to do. The interface between the model and the world is where it falls apart.

2:20Cassidy: This is the conceptual reframe the paper opens with. We talk casually about "an LLM agent" as if it's just the model. But operationally, an agent is three things. There's the model. There's the environment — the database, the household simulator, whatever. And there's a layer of plumbing between them. The plumbing tells the model what tools exist, parses what the model emits, validates it, executes it, and feeds results back. That plumbing layer is what the authors call the harness. Most of the agent improvement work in the last couple of years has gone into the model — fine-tuning, reinforcement learning, distillation — on the assumption that capability is what's missing. The paper's claim is that for a huge class of environments, capability isn't missing. The interface is the bottleneck. And if that's true, you should be working on the interface.

3:18Eric: Before they propose anything, they do something I really like — they do the empirical homework. They run a baseline agent on training tasks, watch it fail, and they sit down and classify every failure by hand. Failures sort into four categories. One: the model wrote something the environment can't execute — wrong format, prose instead of a tool call. Two: the action is executable but it violates the tool's rules — say, you tried to cancel a flight that already departed, and the cancellation tool only works on flights that haven't taken off. Three: each individual action is fine but the agent is stuck — looping, repeating, going nowhere. Four: an actual reasoning error, where the model genuinely doesn't know what to do.

4:12Cassidy: And there's a subtle methodological move in how they classify, which is — they check in priority order. Action realization first, then contract, then trajectory, then reasoning. Because the symptoms cascade. An agent that writes prose instead of a tool call will eventually start looping when nothing executes. If you classify that as "trajectory degeneration," you've completely missed the bug. The loop is the symptom, the malformed output is the cause.

4:45Eric: And the headline finding from the diagnostic is — different environments are dominated by different failure modes. ALFWorld is mostly category three, trajectory degeneration. The agent keeps looping. The OS tasks — running shell commands — are heavy on category two, contract violations. There's no single root cause. Which is why the paper doesn't propose a single fix.

5:13Cassidy: The architectural choice they make from there is what I'd call structural discipline. Each failure mode maps to a specific moment in the agent's lifecycle — before it acts, while it's choosing, when its output meets the world, and after the action has landed. So they design a harness with four layers, each one catching one failure mode at the earliest point it can be caught.

5:41Eric: Let's pull out the two most viscerally graspable. The Action Realization Layer — think of it like the form-validation layer on a website. You fill out a contact form, you forget the area code on the phone number, and a well-designed form catches it before submission. Fixes it for you, or says "did you mean...", or just blocks the submit. The action realization layer does this for the model's tool calls. The model emits something. Before that something hits the environment, the harness inspects it: is this a valid tool call? Is the syntax right? Are the arguments well-formed? If yes, pass through. If it's close but malformed, canonicalize — fix it. If it's broken, block it and tell the model what went wrong, in structured terms — rather than letting the action fail downstream and produce some opaque error the model has to puzzle through.

6:38Cassidy: The second one is the Trajectory Regulation Layer, which is the GPS that notices you're driving in circles. A bad GPS will keep telling you to turn left every thirty seconds while you go around the same block four times. A good one says — hey, something is wrong with this route, let's reconsider — and forces a recalculation. The trajectory regulation layer watches the agent's recent actions, notices when it's repeating itself or burning steps without progress, and injects a recovery prompt. This is the layer that, when you remove it from ALFWorld, performance collapses — the ALFWorld score drops by eighty-six percent relative to the full harness. Almost back to baseline. The household tasks are loop-prone, and that layer is what was keeping the agent from drowning in its own repetitions.

7:33Eric: And there are two more layers we'll just gesture at. An Environment Contract Layer that rewrites tool descriptions to spell out the edge cases upfront — the cancellation-only-works-on-undeparted-flights kind of clarification. And a Procedural Skill Layer that retrieves relevant notes distilled from training trajectories when the agent starts a new task. Both are closer in spirit to careful prompt engineering. The action and trajectory layers are where the harness is doing things prompt engineering can't do — intercepting outputs, monitoring state across turns.

8:11Cassidy: Now here is where the paper gets interesting in a different way. They don't hand-write the harness. They evolve it. They take a small base model — Qwen3-4B — run it on training tasks, collect every failed trajectory, and hand those failures to a coding agent. Codex. The coding agent reads the failure transcripts, identifies patterns, and writes code that gets inserted into the appropriate layer of the harness. Then they re-run, collect new failures, generate new patches. Repeat until performance stops improving — which, in their setup, happens within about five rounds.

8:49Eric: And one detail I want to flag, Cassidy, because it matters for what comes next. The evolution is constrained by the four-layer structure. The coding agent isn't free to write arbitrary code or restructure the whole system — it's told, these failures look like action-realization issues, propose a patch for the action-realization layer. That constraint matters. It prevents the kind of unbounded drift you'd get if you let a coding agent rewrite anything it wanted. The four-layer taxonomy is doing real load-bearing work — it's the scaffolding that keeps the evolution process from going sideways.

9:28Cassidy: And held-out evaluation throughout — they never touch the evaluation tasks during evolution. The harness is built using one small model's failures on a training set, and never tested on anything else during construction.

9:43Eric: Which sets up the moment in the paper that I think is the real climax. They freeze that harness. They then take it and apply it, unchanged, to seventeen other models. Models from seven billion to seventy billion parameters. Instruction-tuned models, reasoning-specialized models, tool-use-specialized models. Across seven environments. They run the agent with and without the harness.

10:08Cassidy: And the harness improves a hundred and sixteen out of a hundred and twenty-six model-environment combinations. Ninety-two percent. Average relative improvement of almost ninety percent — call it close to a doubling on average.

10:23Eric: The headline single-pair example — Llama-3.1-8B on ALFWorld goes from five-and-a-half percent to eighty-two-point-six percent. The harness was never tuned for Llama. It was tuned on a different model family's failures, on a small four-billion-parameter version, and dropped onto an eight-billion Llama, and the score went up by a factor of fifteen.

10:46Cassidy: And one detail worth mentioning on how these gains are measured. Some of the benchmarks here — tau-bench and tau2-bench in particular — report scores under what's called pass-cubed, where a task only counts as solved if the agent succeeds on three independent runs. Others, like AgentBench, report pass-at-one, a single attempt. So the headline numbers are a mix. But for the pass-cubed benchmarks, you're not measuring whether the model can stumble into the answer once. You're measuring whether the agent does the task reliably, three times in a row.

11:21Eric: Which, on the benchmarks where it applies, is the bar that actually matters if you're putting an agent into production.

11:29Cassidy: That's the moment where the conceptual claim earns its keep. If the harness had captured something model-specific — quirks of how Qwen tokenizes, idioms of Qwen's outputs — it shouldn't transfer. The fact that it does transfer, across model families, across scales, suggests strongly that what the harness learned isn't about the model at all. It learned things about the environment. The fact that the cancellation tool only works on undeparted flights. The fact that ALFWorld actions follow specific syntactic patterns. The fact that certain action sequences indicate the agent is stuck. Those are facts about the world the agent is operating in, not facts about any particular model.

12:14Eric: Here's the cleanest analogy I have for that, Cassidy. Think of a really well-designed cockpit. It flags conflicting inputs, refuses impossible commands, sounds alarms when the pilot is ignoring something important. That cockpit helps a rookie pilot dramatically. It also helps a veteran pilot. It wasn't designed around any particular pilot's brain — it was designed around the plane and the rules of flight. That's what made it transfer. The harness is the cockpit. The environment is the plane.

12:47Cassidy: The operational upshot, before we get to the rest of the empirical work, is a real shift in deployment economics. Right now, the playbook for getting good agent behavior in a domain is — collect data, fine-tune a model, deploy. If your organization upgrades from Llama-3.1 to Llama-3.3, you redo the work. If two teams use different model families, you do it twice. The harness flips that. You adapt the wrapper once per environment. Every model that respects the same protocol can use it. The cost goes from being proportional to models times environments, to being proportional just to environments. For organizations putting agents into production, that is a meaningfully different bill.

13:32Eric: There's a comparison in the paper that drives this home in a different way — one I keep coming back to. There's a model family called xLAM. It's a series of models that Salesforce post-trained specifically for tool-use agent tasks. They took Qwen2.5-32B-Instruct and fine-tuned it extensively for the kind of conversational tool use that tau-bench measures — tau-bench being one of the standard customer-service-style benchmarks, airline booking and retail and so on. The whole point of xLAM is — we trained for this, so we should win.

14:10Cassidy: And on the in-domain benchmark, does xLAM beat the base Qwen?

14:15Eric: It does. xLAM beats base Qwen on tau-bench. Training helped. But now — take that base Qwen, the one xLAM started from, and wrap it in Life-Harness. No retraining. The base model plus the harness beats the specifically-trained xLAM by seven-and-a-half points on the same in-domain benchmark.

14:36Cassidy: The base model wearing a good interface beats the model whose weights were specifically trained for the task.

14:44Eric: And then there's the kicker. xLAM, on benchmarks outside its training distribution, actually underperforms its own base model — by six to twenty-seven percentage points depending on the benchmark. The specialized training hurt generalization. The harness does the opposite. It helps the base model on the in-domain benchmark and also helps it on the out-of-domain ones. There's no displacement cost. It's a little like the tutored-student-versus-the-student-with-a-good-calculator picture — except the student with the calculator wins the calculus exam and the algebra exam, because nothing got crowded out.

15:28Cassidy: And there's a comparison to prompt-only methods worth flagging briefly. There are existing approaches that try to improve agents by automatically rewriting the prompt the model sees — GEPA is the most-cited example. Those methods give modest gains. Life-Harness adds well over a hundred percent more relative improvement on top of what prompt-only optimization gets you. Which is the cleanest evidence that the action and trajectory layers — the parts you can't reach by rewriting prompts — are doing real work.

16:03Eric: Now, Cassidy, I think this is where we should be careful, because the paper is doing real rhetorical work with these comparisons and the steelman pushback matters.

16:15Cassidy: It does. I want to voice it honestly because the paper deserves a careful read.

16:20Eric: A few things. First — the benchmarks. Every benchmark in the evaluation suite, tau-bench, tau2-bench, AgentBench, ALFWorld, the OS tasks, the SQL tasks — every one of them is deterministic and rule-governed. Tools have fixed schemas. Errors are reproducible. Feedback is structured. The harness method is best at exploiting exactly that property. So the eighty-eight-point-five percent average relative improvement is averaged over precisely the benchmarks where structure-at-the-interface is most extractable. We don't see what happens when the environment is noisier or more open-ended.

17:02Cassidy: And the authors are explicit about this. They call out, in the limitations, that fully open-ended agent tasks — browsing arbitrary websites with varied goals, that kind of thing — probably won't yield to this approach in the same way. In those environments, you can't define a stable runtime interface, you can't reproduce failures cleanly, and you can't evolve a harness that generalizes. So the scope is bounded. This is a method for the structured part of the agentic landscape — database manipulation, business workflows, OS tasks. Not for the wild-west part.

17:41Eric: Second concern. The harness, when you look at what's actually in it, is doing a lot of task-specific work. The appendix has the full inventory. It's things like — "remind the model that flight search takes only origin, destination, and date" — or specific SQL syntax repairs. Real, useful, reusable across models — but environment-specific, often very environment-specific. So the "model-agnostic" framing is true; the "general-purpose mechanism" framing would be wrong. The harness isn't a generic mechanism. It's a hand-tuned — well, Codex-tuned — crib sheet for each environment. The honest defense is the one the paper actually makes: adaptation happens per environment, and the claim is about reusability across models within an environment. As long as that's the claim, it holds.

18:36Cassidy: Third — and this is the one I think is genuinely under-explored — the Codex-in-the-loop evolution process. The harness emerges from a coding agent inspecting trajectories and writing patches. Different Codex versions, different inspection prompts, different random seeds — these could plausibly produce different harnesses. The paper gives the prompts and the final inventories, but it doesn't really test how robust the evolution process is to its own hyperparameters. If you ran it ten times, would you get ten similar harnesses, or ten different ones? That's not answered.

19:16Eric: And one more, smaller but real. The leave-one-layer-out ablation — the one that shows ALFWorld collapses without the trajectory layer and Airline collapses without the action layer — that's only tested on Qwen3-4B, the model the harness was evolved from. Whether each layer is carrying its weight on the transferred models is unverified. Some layers might be doing important work for Llama and dead weight for the reasoning-specialized models. We don't know.

19:47Cassidy: All of which is to say — the paper's central claim, that you can adapt the interface instead of the model and the result transfers across models within an environment, is well-supported. The bigger claims that hover around it — that this fundamentally reframes where intelligence lives in agentic systems, that training is the wrong paradigm — those are gestured at, and they're interesting, but they're stronger than the evidence strictly warrants.

20:16Eric: Though, Cassidy, I'd want to give one inch back. Even with all those caveats, the fact that a harness evolved from one small model's failures helps a model ten times its size, on tasks it never saw, is the kind of result that suggests something real about where the structure is. The paper isn't proving that training is wrong. It's showing that for a meaningful class of environments, structure we've been pushing into model weights actually lives in the environment, and we've been making ourselves do extra work by not noticing.

20:49Cassidy: That's a fair gloss. And one more thing about why this matters operationally. The pattern of the last few years has been: smarter models will fix agent reliability. GPT-4-class reasoning will translate into GPT-4-class tool use. And it sort of didn't. Agents kept failing in mechanical, surprisingly dumb ways even as the underlying models got more capable. This paper is one entry in a broader shift toward taking the system around the model seriously — the prompts, the scaffolds, the verifiers, the harnesses — as a first-class object of design. Not as a workaround we'll someday train away, but as where some of the real engineering actually lives.

21:31Eric: And the specific contribution here, beyond the conceptual reframe, is the lifecycle-structured taxonomy. Other recent harness-optimization work mostly lets a coding agent loose on the harness code and search. This paper constrains the search by saying: there are four moments in the lifecycle, every patch goes into one of them. That constraint is what makes the harness coherent enough to transfer rather than degenerating into a tangle of overlapping fixes.

22:01Cassidy: One line from the paper I want to land before we close. "An LLM agent is not just an LLM." That sentence is doing more work than it looks like. It's saying — when you build something that interacts with the world, the part that interacts is as much the system as the part that thinks. And if the part that interacts is broken, no amount of making the part that thinks smarter will save you.

22:27Eric: And it suggests where some of the next several years of agent work will probably go. Not exclusively into bigger and better base models. Also into the unglamorous engineering of the interface layer. Form validators. Loop detectors. Contract clarifiers. The stuff that, in a normal software system, you'd consider obvious. Apparently agents need it too.

22:51Cassidy: The show notes have the paper and some related reading on harness optimization and agent scaffolding — worth a pull if this episode caught you. And if you want the full transcript with the jargon defined inline, plus the concept pages tying this episode to the other agent-systems work we've covered, that's all on paperdive dot AI.

23:13Eric: Thanks for listening to AI Papers: A Deep Dive.