Training a Tiny Model to Run the Plumbing Between an Agent and the World

0:00Juniper: Picture a brilliant consultant locked in a room. She's one of the sharpest problem-solvers alive, but she can only talk to you one way — notes slipped under the door. You write down the problem, slide it in, she slides back what to try next, you go do it, and then you tell her what happened. Over and over, for hours. Here's the part nobody thinks about. Somebody is standing at that door deciding which notes actually reach her, and what happens to the notes she sends back out. If that person hands her every scrap of paper from the last six hours, she's buried. If they quietly drop something she'll need later, she's stuck.

0:44Tyler: And in the AI version of this, who is that person?

0:48Juniper: Right now? Hand-written code. A pile of rules some engineer tuned by hand. The paper we're digging into today asks a genuinely odd question — what if you trained a tiny AI to be the person at the door instead? It went up on arXiv on June eleventh, twenty-twenty-six, and we're recording two days later. It's called "HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness," and what you're hearing is an AI-generated show — the script was written by Anthropic's Claude Opus 4.8. I'm Juniper, and Tyler and I are both AI voices from Eleven Labs, produced by a team with no affiliation to Anthropic or Eleven Labs. And that person at the door has a real name in this field. It's called the harness.

1:38Tyler: The harness — also called the scaffold. And it's worth being really precise about what it is, because people conflate it with two other things. When you run an LLM as an agent — say, fixing a bug in a real codebase — the model itself only does one tiny thing. Text in, text out, one turn at a time. Everything else is plain software wrapped around it. Code that formats the task, decides which slice of the growing history to feed back in, parses the model's output into a command, runs it, catches the result, loops.

2:14Juniper: So is this just prompt engineering with extra steps?

2:18Tyler: No — and that's the distinction that matters. The harness builds the prompts, but it isn't the prompt. And it definitely isn't fine-tuning, because the model's weights never change. It's the plumbing between the model and the world. Everyone who builds agents knows this plumbing makes or breaks a product. Anthropic has whole engineering posts about it. But it's almost always hand-written — heuristics, summarization tricks, retry logic, all tuned by hand for each model.

2:50Juniper: And that's the crack the authors drive a wedge into. Their argument is almost embarrassingly simple. Over the last decade, machine learning got better at basically everything by replacing hand-crafted pieces with learned ones. Hand-designed features got replaced by learned features. Hand-tuned rules got replaced by trained components. So why, they ask, is the harness still hand-engineered when literally everything else got better by being learned?

3:20Tyler: There's a framing I keep coming back to here. Everyone in agent-land argues about two things — is the model smart enough, and is the task too hard. The engine and the road.

3:33Juniper: And what they're naming is a third axis. Not the engine, not the road — the transmission. The thing that decides how the engine's power actually reaches the wheels. A great engine with a bad transmission stalls and wastes fuel. The same engine with a good one is suddenly faster and more efficient at the same time. That's the whole bet of this paper — that the interface between the agent and the world is a real, separate component you can optimize on its own.

4:04Tyler: And the punchy version of the result is exactly that double win. They take a frozen model — never retrained, never even touched at the weights level — and just by changing what flows in and out, they raise the success rate and cut the token cost by up to ninety percent. Same engine. Better transmission.

4:25Juniper: So let's make the learned harness concrete, because it has two jobs, and they map perfectly onto our consultant at the door. One job is incoming — what notes reach her. The other is outgoing — what happens to the notes she sends back. The paper calls them the observation projection and the action projection. I'll take the incoming side first. The incoming side is basically a chief of staff. A good chief of staff doesn't dump every email and meeting transcript on the CEO's desk. They decide: this one lands verbatim, this one gets condensed to a one-paragraph brief, this one gets filed away. And they keep a standing one-page memo at the top of the stack — current priorities, open issues, what's still broken.

5:15Tyler: So when you say "filed away" — the harness is deleting old turns to save space?

5:21Juniper: That's the natural assumption, and it's the one thing I want to head off, because the design hinges on it not being true. Nothing gets deleted. The full raw history is always preserved as the authoritative record. What the harness produces is a view — what the consultant sees this turn — laid over a transcript that's still completely intact underneath.

5:46Tyler: So it's noise-cancelling headphones, not earplugs.

5:50Juniper: That's exactly it. Earplugs destroy the sound. Noise-cancelling just changes what you hear while the full signal keeps playing, and you can dial it turn by turn. That matters because it guards against the two classic ways context compression blows up — the summary hallucinates something that was never there, or you irreversibly throw away a detail you needed forty turns later. If the real record is always sitting underneath, neither failure is fatal.

6:21Tyler: And the actual decision it makes per turn isn't a yes-or-no, right? It's three-way.

6:27Juniper: Three-way, and that's the clever bit. Each past turn gets one of three calls — pass it through exactly as it was, compress it into a short summary that preserves the exact values, or drop it from the view entirely. And why three options instead of just "keep or summarize"? Because turns aren't all the same shape. A test output or an error traceback loses critical detail the moment you paraphrase it — that has to pass verbatim. A long directory listing might have one useful fact buried in pure noise — compress it. And some banner output, or a retry that got superseded, carries nothing — drop it.

7:08Tyler: Plus the standing memo.

7:09Juniper: Plus the standing memo — they call it the active-state index. Instead of forcing the agent to reconstruct "where am I, what's broken, what have I already ruled out" from sixty turns of raw logs every single time, the harness just writes that state down and pins it at the top. There's a beautiful case study for this. A django bug fix — django's a popular Python web framework — that ran sixty-seven turns. By turn sixty-seven the raw context was enormous. The projection kept the original task and the recent turns, squashed six middle turns of exploration into three bullet points, dropped two dead-end turns, and pinned the live blocker at the top: "test command failed, the test runner isn't in the current directory." That one line is what the agent actually needed, and it was drowning at the bottom of the log.

8:05Tyler: Okay, that's the incoming side. Let me take the door going the other way, because honestly this is the half I find more elegant. The outgoing side is a bouncer. The agent proposes an action — run this command — and before it reaches the environment, the harness gets to look at it and either let it through or bounce it back.

8:27Juniper: And a bouncer's whole job is saying no.

8:29Tyler: That's what you'd expect, and it's exactly the trap. The authors built in this hard rule that I think is the single best design idea in the paper. This bouncer can only turn an action away if it can point to the security tape. It has to quote a specific line from the actual trajectory as evidence — "you tried this exact thing twenty turns ago and it failed, here's the footage." No quote, no rejection. If it can't produce the evidence, it must let the action through.

8:59Juniper: No quote, no veto.

9:01Tyler: No quote, no veto. And it's not even really a bouncer in the end, because it doesn't just block you. When it rejects an action it hands back a structured note — here's my concern, here's the evidence, here's a concrete suggestion for what to do instead. It redirects rather than blocks. There's a case study with xarray — that's a scientific data library in Python. The agent had correctly found the buggy code. But then it spent more than ten turns running simulations of the logic in throwaway Python scripts, instead of just testing the actual modified code. The harness rejected the next simulation and said, in effect, "run the real test against the modified code." And the agent's very next action finally produced a real signal about whether the fix worked.

9:49Juniper: What I love about the quote rule is what it's defending against. Without it, you'd get a gatekeeper that just nags — that hallucinates reasons to reject things because rejecting feels productive.

10:02Tyler: And that instinct is exactly backwards, which they prove cleanly. The deployed system has this baked-in philosophy, and I'll paraphrase the actual prompt: the cost of a false reject — a wasted turn, lost momentum — is higher than the cost of letting a questionable command through. Because a bad action at least produces an informative failure. You learn something. A wrongful rejection just burns a turn and breaks the agent's flow.

10:29Juniper: It's the overzealous spam filter problem.

10:32Tyler: It's precisely the spam filter problem. Everyone's lived it. A filter tuned too aggressively starts eating real email, and one lost important message hurts far more than ten spams getting through. Default to pass. Restraint is the hard part of building a gatekeeper.

10:49Juniper: So both of these behaviors — the chief of staff and the bouncer — live in one model. And this is where the numbers start to get a little absurd. It's a single model, eight-tenths of a billion parameters, doing both jobs. The two tasks differ only in the instructions and the output format. And eight-tenths of a billion is tiny — frontier models are hundreds of billions.

11:13Tyler: How does something that small learn to make these calls at all?

11:17Juniper: Instruction tuning, distilled from bigger models. And the intuition for why it works is the important part. A small model can't fix a bug end-to-end — that's open-ended reasoning. But "is this turn worth keeping verbatim?" or "is this action a repeat of something that already failed?" — those are narrow, well-defined judgment calls. And small models can be genuinely excellent at narrow judgment calls if you show them a few thousand clean examples. They ran a preliminary check where the fine-tuned eight-tenths-of-a-billion model performed about as well as using a vanilla thirty-five-billion-parameter model as the harness — at roughly two percent of the per-token cost.

12:00Tyler: Which is the whole economic argument in one line. You spend pennies of tiny-model compute to avoid dollars of frontier-model context bloat.

12:10Juniper: And here's the detail that I think is the soul of the paper. When a model's entire skill comes from imitating examples, the examples are the behavior. So the data curation isn't a footnote — it's the actual engineering. They started with about forty thousand raw candidate examples, generated by running agents with prompted harness interventions, and then filtered them brutally. Only traces from successful task completions survived. A rejection example was only kept if the rejection was grounded in real evidence, the agent's correction afterward actually worked, and the intervention genuinely saved steps. Final dataset — about five thousand four hundred examples.

12:54Tyler: And there's a horror story buried in that filtering that I think is the most instructive experiment in the whole paper.

13:01Juniper: Tell it, because this is where the spam-filter analogy pays off.

13:06Tyler: So they rebuilt the training pipeline once with a different source model and a sloppier recipe — they called it "turn-agnostic." They kept all the correct trajectories, including ones where the harness intervention actually added turns. Just noise where the meddling didn't help. And the controller they trained on that diet learned to reject thirty-seven percent of all actions. Trigger-happy. And it dragged success below having no harness at all — from about fifty-seven percent down to forty-eight.

13:38Juniper: Worse than nothing.

13:40Tyler: Worse than the empty room. And the fix wasn't a new architecture. They switched to a "turn-saving" recipe — only keep examples where the intervention demonstrably saved tokens or turns — and the rejection rate roughly halved and the result flipped to a gain. Train your filter on examples of unhelpful meddling, you get a filter that meddles constantly. The behavior comes from the diet, not the design.

14:07Juniper: So let's get to whether this actually pays off, because the framing promises a lot. Tyler, you've spent the most time in the results tables — what's the cleanest number?

14:18Tyler: The cleanest number in the paper, and it's not even close, is what they call the gained-tasks contrast. Take the tasks where the harness succeeds and the bare baseline fails. On those, the agent gets to the answer in eighteen turns. The baseline, on the same tasks, wanders for fifty-two before failing. About a third of the turns. And it uses eleven percent of the token budget — an eighty-nine percent cut.

14:45Juniper: Say what that means, though, because it's a real reframe.

14:50Tyler: It means the baseline isn't failing because the model can't solve the task. The exact same frozen model solves it fine with a better interface. It was failing because it was drowning in its own unproductive exploration. Fifty-two turns of stale errors and abandoned hypotheses, and the signal it needed was buried somewhere in the middle. That recasts a whole category of what we call "capability failures" as interface failures. The model wasn't dumber. It was suffocating.

15:22Juniper: And that's the third axis made visible. Same engine, two different transmissions — one stalls, one doesn't.

15:29Tyler: The transfer numbers are the other headline, and here you have to understand the setup or it sounds routine. The training data came only from SWE-bench — real GitHub bug reports where the agent has to produce a fix that passes the project's actual tests — and using only one open-source model as the agent. So every result on the other benchmark, Terminal-Bench, which is hard command-line tasks, is an out-of-domain test. And every result on a commercial model is out-of-family — the harness never saw that model in training.

16:04Juniper: This is the part I'd flag for anyone half-listening — it's not ordinary cross-model evaluation, where you train and test on the same kind of thing. They trained the interface on one open model's logs and then bolted it onto GPT, onto Claude, onto DeepSeek, untouched.

16:22Tyler: And the most dramatic transfer was on a small, wasteful GPT model. Success went up from eighteen percent to twenty-two and a half — a twenty-five percent relative gain — while tokens dropped from about nine-point-eight million to under one million. Roughly a ninety percent cut, from a controller that had never seen a single GPT trajectory. It works on any model because it only ever manipulates the conversation, never the brain. And you can't retrain GPT or Claude anyway — they're API products. An improvement that lives entirely in the text flowing in and out is one almost anyone can actually use.

17:06Juniper: There's one more result I find quietly delightful, which is what the thing learned to compress without being told. They broke it down by category. Pure reasoning turns — the model thinking out loud — it compressed about thirty-nine percent of the time. File navigation, reading directories, around a quarter. But test execution output? It compressed that only three percent of the time.

17:34Tyler: Nobody told it tests were sacred.

17:36Juniper: Nobody told it. It figured out on its own that test output is the decisive verification signal — the thing later turns actually depend on — so it almost never touches it. And the categories it compresses most are the same ones it most often distills up into that pinned memo. It's not discarding. It's filing.

17:59Tyler: Okay. I want to push on the framing now, because the abstract is sunnier than the tables.

18:06Juniper: Go ahead — this is the part you've been itching to get to.

18:10Tyler: The headline phrase is "matches or surpasses" the hand-built harnesses. But look at SWE-bench Verified — the in-domain benchmark, the one the controller actually trained on. With the open-source generator, HarnessBridge scores slightly below the official reference scaffold. Sixty-point-two versus sixty-one-point-six. The success-rate wins basically all live on the other benchmark, the out-of-domain one.

18:39Juniper: So the honest one-liner is —

18:41Tyler: Same success, far cheaper, sometimes better. The token savings are robust and real across the board. The accuracy gains are situational. And "matches or surpasses" is doing a little quiet lifting over a result that, in-domain, is matches-to-slightly-down.

19:00Juniper: That's fair, and to their credit, the numbers are all right there in the table. What about the single-run issue?

19:08Tyler: They concede it themselves, and it deserves weight. Every result is a single run, because the evaluations are expensive. But agent benchmarks are high-variance. Some of the smaller deltas — a model going from sixty-four to sixty-five — are well inside plausible run-to-run noise. Now, the big effects almost certainly survive. A ninety percent token cut, eighteen turns versus fifty-two — you don't get those from noise. But the small success-rate bumps? I wouldn't bank on them.

19:42Juniper: And there's the token-accounting wrinkle.

19:45Tyler: Yeah, the ninety percent number counts the generator's input tokens. But the harness itself isn't free — it processes roughly three times as many tokens as the generator does. Their defense is a compute-weighted argument: the harness is about forty-four times cheaper per token, so the net overhead is around seven percent, and the system is still cheaper end to end. I buy that. But I'd want the headline tables to report total system cost, not generator-only cost. Under full accounting the savings shrink some, and that math is tucked in an appendix.

20:25Juniper: Is there a piece of this you genuinely can't put to rest?

20:30Tyler: There is, and it's the one the paper half-admits itself. The gains track baseline wastefulness. The improvement is huge when the baseline is sloppy and token-heavy, and it nearly vanishes when the baseline is already lean — on one efficient model they got the same success rate and only an eight percent token cut. Which makes me wonder how much of this method is a genuine new axis of optimization, and how much of it is correcting for harnesses and models that haven't gotten efficient yet. As frontier models get better at managing their own long contexts, does the headroom for this just quietly shrink?

21:08Juniper: I think that's the real open question, and I don't think the paper closes it — and I'm not sure it can with coding agents and single runs. But there's a counter-view worth holding alongside it, and I'll mark this as my own read rather than something they prove. There may always be value in a cheap outer loop the expensive inner model can't run on itself. You can't notice you're drowning in your own context from inside it. The thing at the door has a vantage point the consultant in the room structurally does not.

21:41Tyler: That's a fair steelman. I'll grant the vantage-point argument is real. I'm still not convinced the size of the prize survives the next generation of models. The mechanism is sound — the magnitude is the part I'd hold loosely.

21:55Juniper: And that's an honest place to leave it. What I don't want to lose, though, is the conceptual gift here, separate from any single number. They've named a third thing. For years the conversation has been: better models, better environments. This says there's an interface in between, it's independently optimizable, and a surprising amount of the artisanal harness craft that teams currently hand-tune per model can be learned from a few thousand examples and carried across models — closer to a reusable part than a bespoke installation.

22:28Tyler: And the scope caveat is right there in the paper, to be clear — this is coding agents, single runs, one generation of models. They argue the mechanism is domain-agnostic since it operates on generic tool-use trajectories, and they expect it to transfer to web navigation and computer use. But they flag, plainly, that that's unvalidated. The transferable-interface idea is promising and young.

22:54Juniper: Promising and young is the right note. If it holds up, the thing I'll remember is that reframe of failure — that the same frozen model fails in fifty-two wandering turns under one interface and succeeds in eighteen under another. The bottleneck wasn't the brain in the room. It was the slot under the door.

23:14Tyler: And that the hardest part of building the gatekeeper was teaching it when to do nothing.

23:20Juniper: The paper's linked in the show notes, along with some related reading if you want to go deeper on agent harnesses and context engineering.

23:29Tyler: And if you want the full transcript with every term tappable for a definition, plus the concept pages that link this over to other episodes we've done, that all lives on paperdive dot AI.

23:41Juniper: This has been AI Papers: A Deep Dive. Thanks for spending the time with us.