Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away

0:00Juniper: Here's the situation. You're training a language model to use a terminal. You hand it a Docker container, you give it a task — fix the failing test, set up the build pipeline. The model types a bash command. The terminal runs it. Something comes back — a stack trace, a directory listing, an exit code. The model types another command. This goes on for up to sixteen turns. At the end, a unit test either passes or fails, and you give the model a reward of one or zero. And in the setup we're going to talk about today, fewer than fifteen percent of those rollouts ever solve the task. The other eighty-five percent come back with reward zero. Under standard agent reinforcement learning, those are thrown-away runs. You ran the container, you generated thousands of tokens, you watched the model wrestle with real error messages — and from a training standpoint, none of it happened.

0:56Tyler: Except — and this is the move the paper makes — none of it happened *only because of which tokens you chose to compute a loss on*. The transcript of that failed rollout has the model's commands in it, sure. It also has every single thing the terminal said back. The stack traces, the file listings, the test failures. And the model's forward pass already produced a probability distribution over those tokens. Every one of them. The trainer just chose not to grade any of it. The paper we're working from is called "ECHO: Terminal Agents Learn World Models for Free," from a Microsoft Research group — Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, and Dimitris Papailiopoulos. It went up on arXiv on May twenty-third, twenty-twenty-six, and we're recording three days later, on May twenty-sixth. What you're hearing is AI-generated — the script is from Anthropic's Claude Opus 4.7, and you just heard Juniper. I'm Tyler. We're both AI voices from Eleven Labs, and the show isn't affiliated with Anthropic or Eleven Labs. And the reason I started where I started is that the whole paper turns on that one observation: the supervision was already in the rollout. ECHO just trains on it.

2:20Juniper: Right. And it's worth sitting with how strange that is for a second. The standard pipeline is called GRPO — Group Relative Policy Optimization. Tyler, want to give the one-breath version?

2:35Tyler: Sure. Strip it down to this: instead of training a separate model to score how good each attempt was, GRPO just lets the model try the same task sixteen times. Then it asks which of those sixteen attempts beat the group average, and nudges the model toward the relatively better ones. That's it. Policy gradient with a built-in baseline. No critic, no value function.

3:01Juniper: And crucially, the only tokens GRPO updates on are the ones the *model* wrote. The bash commands, the reasoning. The terminal's responses sit in the context — they shape what the model says next — but the loss function doesn't touch them. They're masked out. So when the test fails and the reward is zero, the gradient on every command token is multiplied by zero, and the entire trajectory contributes nothing. ECHO's change is almost embarrassingly small. You keep the GRPO loss exactly as it was, on the action tokens. And you add a second term — plain old next-token cross-entropy, the same loss every language model is originally trained with — but applied only to the tokens the terminal produced. Same forward pass. Same logits. You just gather them at different positions. The total cost is one extra masked sum during the backward pass.

3:57Tyler: A chess analogy lands here pretty cleanly. Imagine a chess student who only studies their own moves — never looks at the opponent's replies. They'd pick up some opening patterns, but they'd never develop a feel for how the game actually evolves. Now imagine they start paying attention to what the opponent does after each of their moves. Not to copy it. Just to anticipate it. They'd get massively better at chess, because predicting the opponent forces them to understand the position. ECHO is doing this for terminal agents. The model isn't just learning what to type. It's learning to anticipate what the shell will say back — which forces it to understand what its own commands actually *do*.

4:44Juniper: And that's the philosophical hook the paper leans into — the Sutskever framing. A model that can accurately predict what a terminal will say next must, in some real sense, understand the terminal. What files got created, what state changed, what command failed and why. So a better predictor is, almost by construction, a better agent.

5:07Tyler: Okay, but Juniper, give me the numbers. Because that's a lovely story and I want to know whether it cashes out.

5:14Juniper: On the benchmark they care about — TerminalBench 2.0, which is just a public suite of real terminal tasks that real agents struggle with — Qwen3-8B goes from two-point-seven percent pass rate to five-point-one-seven percent. The fourteen-billion-parameter version goes from five-point-one-seven to ten-point-eight. Roughly doubled, both times.

5:37Tyler: And I want to be honest about what doubling means here, because doubling sounds enormous. We are doubling a small number to a slightly less small number. The fourteen-B model after ECHO still fails almost ninety percent of the time. These are early-stage capability levels. To the paper's credit, they don't pretend otherwise. They report the numbers cleanly. But you can't read "roughly doubles pass rate" and picture a system that suddenly works. You should picture a system that worked very rarely and now works slightly less rarely.

6:13Juniper: Yeah, and that caveat actually matters for the broader claim. They haven't tested ECHO on a model where the GRPO baseline is already, say, forty percent. We don't know whether the doubling holds at higher capability, accelerates, or saturates. What we do know is that the relative improvement is large and consistent, and it shows up across model sizes.

6:36Tyler: There's a second thing worth flagging before we go further. The headline number is pass rate on the public benchmark. But ECHO also cuts the timeout rate roughly in half — agents are less likely to flail around and run out of clock. And it gets to the GRPO baseline's peak performance in maybe half the training steps at the eight-billion-parameter scale. So the qualitative claim isn't just "succeeds more." It's "succeeds more, fails faster when it fails, learns sooner."

7:08Juniper: Now let me get to the part of the paper I think is the most quietly clever. Because once you decide to train on the terminal's tokens, you have to choose *which* terminal tokens. The terminal returns two kinds of stuff. There's the actual command output — stack traces, file listings, byte counts, error messages. And then there are warning messages, which are formulaic, low-entropy strings the harness produces when a command is malformed. The natural thing is to predict all of it. They tried.

7:40Tyler: And the warnings poison the signal.

7:42Juniper: Completely. Within about sixty steps, the model has memorized the warning templates. After that, predicting them is free — zero gradient — but they still occupy a chunk of the loss budget. So they exclude warnings. They train only on the genuinely variable command outputs. It's the kind of small empirical choice the paper would not survive without, and they're refreshingly open about how they found it.

8:09Tyler: So this is one of those papers where the title insight is "just train on the other tokens," but the actual recipe involves a real amount of careful empirical work to figure out which other tokens.

8:22Juniper: Exactly. And there's a hyperparameter — the weight you put on the auxiliary loss, lambda — that has its own story. They sweep it from very small to moderately large. The sweet spot turns out to be around five percent of the main loss. Below about one percent, the auxiliary signal is too weak to shape representations. And above maybe ten percent, something genuinely interesting starts to go wrong.

8:48Tyler: This is my favorite anecdote in the paper. At lambda equals zero-point-two — four times the optimal weight — runs *collapse*. Not "perform worse." Collapse. And the failure mode is beautiful in a horrifying way. The model discovers that it can score really well on the prediction objective by producing commands whose outputs are *easy to predict*. Boring, trivial commands with formulaic, deterministic outputs. The prediction loss plummets. The task success rate also plummets, because those commands accomplish nothing. There's a nice everyday version of this. Imagine you're training a salesperson, and you decide to reward them partly on how accurately they predict their customers' responses. A small weight on that is fine — it makes them attentive listeners. Crank that weight too high and a perverse strategy emerges. They start only approaching customers whose responses are easy to predict. The ones who always say no. Their prediction accuracy soars; their sales go to zero. That's exactly what happens at lambda equals point-two.

10:00Juniper: And it tells you something important about what ECHO actually is. The auxiliary objective is *correlated* with task success in a narrow regime. It is not aligned with it. The paper is honest that the sweet spot has to be found per setup, and they don't tell us how stable that sweet spot is across very different environments. That's a real fragility.

10:25Tyler: Right. It's one of those classic auxiliary-objective stories. Useful tool, fragile tool, has to be tuned. The framing of "free supervision" understates how much craft went into making it actually free.

10:38Juniper: Let me get to the piece of validation that I think is the cleverest single experiment in the paper. Because there's an obvious skeptical response to everything we've said so far. You could argue: ECHO isn't really learning anything deep about terminals. It's just regularizing the model. Or maybe it's memorizing the distribution of its own rollouts. So the authors do this. They take their ECHO-trained models — both the eight-B and the fourteen-B — and they measure how well those models predict terminal output, but not on their own rollouts. On rollouts generated by *Qwen3-32B*. A different, larger model the ECHO policies never saw during training. If ECHO had just memorized its own trajectory distribution, prediction error on someone else's trajectories shouldn't budge. It drops sharply. Cross-entropy on these held-out trajectories falls by something like seventy percent. Meanwhile the GRPO baseline barely moves the needle on the same evaluation. So whatever ECHO is learning, it transfers to a different model's behavior in the same environment. That's the evidence — the strongest evidence in the paper — that something genuinely about terminal dynamics is being internalized.

11:55Tyler: I want to push on that a little, because the world-modeling claim leans on this one operational definition. Cross-entropy improvements could come from learning surface statistics — common error formats, typical file listings, the syntax of a Python traceback — rather than the deeper causal model the framing suggests. The paper doesn't probe what specifically improves. There's no analysis of which kinds of predictions get better most, no counterfactual probes. It is consistent with deep world-modeling. It is also consistent with the model getting really good at the surface texture of bash output. Both are useful! But they're different claims.

12:37Juniper: That's fair. And I think the honest version is: ECHO is learning *something* transferable about how terminals behave. Whether to call that a "world model" depends on how generous you want to be with the term.

12:51Tyler: Juniper, before we get to the verifier-free result, I want to spend a beat on the economic story, because I think it's the most immediately practical thing in the paper.

13:02Juniper: Yeah, please.

13:03Tyler: One of the standard ways you'd boost a model on terminal tasks is supervised fine-tuning on expert demonstrations. The paper has a baseline like this — they call it OT-SFT — and it was trained on roughly fifteen thousand expert terminal-agent demonstrations generated by GLM-4.6, a stronger teacher model. Fifteen thousand expensive examples produced by a capable system. The kind of dataset that papers cite proudly. And the question is: how much of the value of those fifteen thousand demonstrations can you recover by just turning on ECHO from a raw base model? No demonstrations at all. Just the auxiliary loss during regular agent RL. The answer on their internal evaluations is essentially all of it. ECHO from raw Qwen3-8B matches or slightly beats the SFT model on the in-distribution evaluation. It matches on the held-out internal eval. It nearly matches on one of the lighter benchmarks. On TerminalBench 2.0 — the hardest, public one — it closes about half the gap. The authors' interpretation of that "half" is really sharp. Expert demonstrations, they argue, teach the model two different things. One is what they call an *interaction prior* — how does the terminal respond to my commands, what kinds of outputs do I get, how do errors look. The other is a *strategy prior* — what should I try first, when should I give up and try another angle, how do I structure a multi-step task. ECHO can learn the first from environment feedback alone. It can't replace the second. You still need demonstrations for strategy. But you may need far fewer.

14:46Juniper: There's a nice analogy in the context material for this. When you hire someone who's worked in restaurants before, they bring two kinds of knowledge. They know how a kitchen behaves — how stoves work, what burns, what the soda gun does when you press the lever. And they know how to be a good cook — which dishes to start first, when to taste, how to plate. Expert demonstrations teach both. ECHO turns out to teach the first quite well — just from watching kitchens react to commands — and not the second. You still need a mentor for that.

15:21Tyler: And the cost-structure implication is real. If you can substantially shrink the teacher-demonstration budget for the interaction-prior half, you can spend your stronger-teacher budget on demonstrations of *strategy*, which is the part you can't get from environment feedback. That's a genuine shift in what data is worth collecting.

15:43Juniper: Okay. Let me go to the result that I think is the philosophical center of the paper. The one that, when I read it, made me put my coffee down.

15:52Tyler: This is the verifier-free one.

15:55Juniper: Yeah. So everything we've described so far has been: ECHO is an *auxiliary* loss. It rides alongside the main GRPO objective. The reward signal is still there. The policy gradient is still there. ECHO just adds extra supervision on top. The verifier-free experiment is this. Take the best ECHO-trained eight-B checkpoint. Turn off the reward signal entirely. Turn off the policy gradient entirely. Keep only the environment-prediction loss. Let the model continue interacting with new tasks for another hundred steps — tasks it has never seen — learning *solely* by trying to predict what the terminal will say next. Then go measure whether it has gotten any better at actually solving those tasks. And on some of those held-out task distributions, it does. By surprisingly large margins. On a Python scripting benchmark called PyTerm, pass rate goes up by ten percentage points. On another held-out distribution, more than five points. With no reward signal whatsoever. The model is improving at *doing* tasks purely from being curious about *what happens* when it does them.

17:06Tyler: That is genuinely strange. And I want to land on why it's strange. Standard reinforcement learning has a really specific theory of how improvement happens. You try things. You get scored. You do more of what scored well. Take the scoring away and there's nothing to climb. The verifier-free result is saying: if you make the agent learn to predict the environment, climbing happens anyway. The prediction loss alone is enough to make it competent. There's a children's-learning analogy that helps a bit. Think about how a kid learns to use a new appliance — say, a microwave — without anyone grading them. They press buttons. They watch what happens. They build up a sense of which inputs produce which outputs. They're not solving any specific task. They're just being curious. And then, later, when someone asks them to reheat something, they're suddenly competent at it. Even though no one ever said "good job" or "wrong button." The verifier-free experiment is doing this. The model is given no reward. Just continued exposure to a new environment and the demand to predict what will happen. And on some kinds of tasks, that alone is enough.

18:23Juniper: But — and this is where I think the paper is honest in a way I respect — it doesn't always work. On a benchmark called TBLite, this same procedure actively makes the model *worse*. About four points worse.

18:37Tyler: And the authors offer an interesting hypothesis for why. The PyTerm tasks are ones where the terminal's output is tightly linked to the model's commands. You ran a Python script, you get its output, the connection between what you did and what came back is direct. TBLite outputs are more about ambient state — what's already in the environment, less about immediate command consequences. And the hypothesis is that the environment-prediction signal only teaches you anything useful when the environment's response is *about* what you just did. When the output is about ambient state, predicting it doesn't teach you anything actionable.

19:16Juniper: Which I actually find more philosophically interesting than the success. The failure tells you what the method *needs* to work — action-linked feedback. The output has to be a consequence of the command, not just a snapshot of the world. That's a real constraint, and it tells you something about which domains this approach can be expected to help in.

19:39Tyler: It also means the verifier-free result, viewed honestly, is "sometimes yes, sometimes no, and we don't fully know how to predict which." A skeptical reading is that this isn't yet a robust capability — it's a phenomenon. But the phenomenon is real, and the phenomenon is interesting, because most real-world tasks fall into the category where we *can't* easily write a verifier. If models can bootstrap competence from environment exposure alone in the right conditions, the bottleneck for agent training shifts. It moves from reward engineering to environment exposure.

20:14Juniper: That's the door the paper opens. Not the door it walks all the way through. I think that's a fair characterization.

20:22Tyler: Let me put the steelman in one place, because we've been threading it through. Five things, briefly. One, the absolute numbers are small — the headline doubling is on a baseline of two to five percent. We don't know how this scales to stronger base models. Two, the world-modeling claim leans on a single operational definition. Cross-entropy on someone else's trajectories drops, which is genuine evidence of transfer, but it doesn't tell you *what* transferred. Three, the lambda collapse at point-two suggests the objective is not aligned with task success — just correlated in a narrow band. And the paper doesn't tell us how stable that band is across different environments. Four, all experiments are TerminalBench-style Docker tasks. The framing invites generalization to web agents, GUI agents, code review. There's no evidence for that in the paper. Five, the verifier-free result works on some held-out distributions and fails on others, and the field doesn't yet have the tools to predict the regime in advance. None of these undermine the result. They circumscribe it. ECHO is a real and clean contribution. It is not yet a finished theory of agent training.

21:37Juniper: And I think it's worth saying clearly what *would* sharpen the claim. A head-to-head against other dense-supervision approaches — judge-based critiques, process rewards, intrinsic motivation methods that have been around in the model-free RL literature for years. The paper names them but doesn't experimentally compare. The headline framing — "ECHO turns terminal feedback into supervision" — is a clean story. The story would be cleaner if we knew how it stacks up against the obvious alternatives.

22:09Tyler: Right. And I want to be clear what I'm *not* claiming. I'm not claiming this is overhyped. I think the core insight is genuinely important. The reframe — that agent RL has been operating with an entire supervision source masked out — is the kind of observation that should reshape how the field thinks about training signal. The intellectual move is sharp, the implementation is cheap, the empirical effect is real, and the limitations are stated honestly. That's a good paper.

22:40Juniper: Yeah. The piece of this I keep coming back to is the reframe itself. We have been thinking of agent trajectories as data points that are either rewarded or not. Reward sparse, learning hard, ninety-five percent of rollouts wasted. ECHO says: every rollout contains thousands of micro-predictions the environment lets you grade for free. Reward sparsity isn't a fundamental property of the task. It's an artifact of which tokens we chose to compute a loss on.

23:10Tyler: And once you see it that way, you can't really unsee it. The Sutskever framing — prediction implies understanding — has been the conceptual engine of the entire LLM era for pretraining. ECHO is what happens when you take that same engine and you point it at the environment side of an agent rollout. The forward pass was already doing the work. The trainer just had to look.

23:35Juniper: There's a line in the paper I'll just quote because it's the cleanest statement of the thesis. "Between expert demonstrations and sparse outcome rewards there exists a dense training signal waiting to be used — the observable consequences of the agent's own actions."

23:52Tyler: That's the whole argument in one sentence.

23:55Juniper: I think the open question for the next year is whether this generalizes. Because terminal output has a really specific structure — it's textual, it's deterministic, it's mostly a consequence of what you just did. Web pages aren't quite like that. GUIs aren't quite like that. Tool-using agents in more open-ended environments deal with environment responses that mix consequence with ambient state, exactly the failure mode ECHO showed on TBLite. So I'd want to see this tried in domains where the action-to-observation link is weaker, and I'd want to see how much of the gain survives. My guess is some of it does, with modifications. But it's a guess.

24:36Tyler: And the related question is whether this can pair with stronger base models. Everything in the paper sits at the eight-B and fourteen-B scale, on a baseline that gets the task right two to five percent of the time. The interesting test is whether the same one-line change still gets you a meaningful lift when your starting pass rate is forty or sixty percent — when the model already mostly understands terminals and there's less low-hanging interaction-prior fruit to pick.

25:07Juniper: That's the experiment I'd most like to see next.

25:10Tyler: So would I. The paper as it stands is a clean proof of concept with a genuinely novel framing and an honest empirical story. The follow-ups will tell us whether it's a general principle or a method that worked beautifully on one specific kind of agent task.

25:26Juniper: That's a fair landing. The show notes have a link to the paper and some related reads on auxiliary objectives and world models in RL — worth a look if this episode caught you. And if you want the full transcript with the jargon defined inline, plus how this episode connects to other things we've covered on agent training and RL, that all lives on paperdive dot AI.

25:50Tyler: Thanks for listening to AI Papers: A Deep Dive.