0:00Juniper: An AI agent is trying to buy a coffee grinder on Amazon. A specific one — it has to fit a countertop that's six-point-three inches square, and the seller rating has to be at least four-and-a-half stars. Reasonable little shopping task. Now, this agent has been around the block. It's done hundreds of tasks before, and after each one it jotted down a lesson for itself — a growing notebook of advice. And here's the strange part. When the researchers handed it that whole notebook of hard-won experience, it did worse than an agent with no notebook at all. It failed. And one of the notes that tripped it up was a rule about verifying Spotify playback.
0:40Finn: A Spotify rule. On an Amazon purchase. That's the kind of detail that sounds like a bug, until you realize it's the entire thesis of the paper.
0:50Juniper: That's the door we're walking through. The paper is called "Not All Skills Help: Measuring and Repairing Agent Knowledge," it went up on arXiv on June thirteenth, twenty-twenty-six, and we're recording three days later, on June sixteenth. Quick note on what you're hearing before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and Juniper and Finn — that's us — are AI voices from Eleven Labs. The team producing the show isn't affiliated with either company. And the reason that Spotify rule matters is that it isn't a glitch. The authors argue it's a symptom of something happening quietly inside basically every self-improving agent built the popular way.
1:34Finn: So let's set up the popular way, because the recipe is genuinely elegant and that's part of why nobody questioned it. You've got a language model acting as an agent — not just chatting, but taking actions. Calling APIs, clicking through apps, running code, watching what happens, deciding the next step. And the self-improving version does one extra thing: after each task, it distills what worked into a short natural-language skill. Little rules. "When you're paginating, loop over all the pages." Then it stuffs the relevant ones into its context window next time.
2:08Juniper: And the appeal is that it's free, in the sense that matters. You never touch the model's weights. No fine-tuning, no training run. You're just giving the model a growing notebook. It's interpretable — you can read every skill in plain English — and on benchmarks like AppWorld it's produced double-digit gains that rival actual weight-tuning. So the field fell in love with accumulation. More tasks, more skills, smarter agent.
2:33Finn: The picture I keep coming back to is the new employee with the sticky notes. After every project they scribble a lesson and slap it on their monitor. Six months in, the monitor is wallpapered in stickies. Some are still gold. Some are stale. And a few are now actively misleading for whatever they're working on today.
2:53Juniper: That's exactly it. And the authors put their finger on the assumption nobody was checking. Across all these systems, the same model does the entire skill lifecycle. It generates the skill, it decides whether to keep it, and it decides when to apply it — all by its own judgment, all while staring at a single task at a time. Nothing in that loop ever steps back and asks, empirically, across many tasks: did this skill actually help? There's a line in the paper I think is the whole argument in one breath. Generating a useful skill from a single task requires creativity. Deciding whether that skill actually helps across many tasks requires empirical evidence that no single task can provide.
3:36Finn: Okay, but let me push on this, Juniper, because I think a lot of listeners are forming the same objection I am. If a skill is genuinely bad — if it makes the agent dumber — wouldn't that just show up as a low average score? You run the agent a bunch, the bad skill drags down success, you notice, you delete it. Why isn't the average enough?
3:58Juniper: Because the bad skills don't have a low average. That's the finding that makes this paper worth an episode. Think about a drug that's wonderful for one group of patients and dangerous for another. If you only look at the average effect across everybody, the benefit and the harm cancel out, and the drug looks completely inert. Looks like a sugar pill.
4:20Finn: And the skills are doing that.
4:22Juniper: Pervasively. They call it causal heterogeneity, which is a mouthful, but the idea is simple. A skill helps on some kinds of tasks and actively hurts on others, and the two roughly cancel, so its average effect sits near zero. Take a real one from the paper. A Venmo skill: "validate that note names map unambiguously to contacts before creating requests." Average effect across all tasks — about negative-point-zero-three. Basically nothing. You'd never touch it. But break it down by task type, and on shared-expense reconciliation tasks, where there really are ambiguous names to catch, it's worth plus-point-five-oh. On single-app tasks where there's nothing to disambiguate, it's negative-point-six-seven. It forces the agent to go cross-reference contacts that don't need cross-referencing, and it tanks.
5:19Finn: So the same note is a half-point of help in one room and two-thirds of a point of damage in the next, and on the spreadsheet it reads as harmless filler.
5:29Juniper: Right. And when they actually measured this across a real skill library, more than nine in ten skills behaved this way. Over ninety percent. They make a heat map — skills down the rows, tasks across the columns, green where a skill helped, red where it hurt. And the punchline is that nearly every single row has both colors in it. There's almost no such thing as a purely good skill or a purely bad one. There are just skills that are good here and bad there.
6:02Finn: Which completely reframes the job. The conventional view is skills are good or bad and the curator sorts them into two piles. This is saying no — almost everything is conditional, and the real question was never "is this skill good," it's "when should this skill be switched on."
6:21Juniper: And once you say it that way, you need a way to measure the "when." Which is where the second big idea comes in, and it's borrowed straight from medicine.
6:32Finn: This is the part I find genuinely clever, so let me take it. How do you find out whether a skill causes better outcomes, as opposed to just being present when things happen to go well? You can't learn that by watching. Healthy people take a lot of vitamins. That doesn't tell you the vitamin did anything. The clean way — the only clean way — is to intervene. Randomize. For each run, flip a coin: is this skill in, or out? Do that many times. Then compare the average success rate when it was in against when it was out. The gap is the skill's causal effect.
7:08Juniper: A randomized controlled trial. For an agent's memory.
7:11Finn: That's the whole move. Treat every skill like a candidate drug, randomize who gets it, measure the outcomes. And what falls out is what they call an attribution matrix — that green-and-red grid. Each cell is the answer to "did including this skill help or hurt on this particular task," estimated honestly through randomization rather than guessed at by the model itself.
7:35Juniper: Now here's a question I had reading it. Why randomize all the skills at once? The obvious experiment is cleaner — test one skill at a time. Add just the Spotify rule, see what happens. Add just the pagination rule, see what happens. Why not isolate them?
7:51Finn: Because skills aren't independent. A skill's value depends on what else is in the pot with it. Picture a verification rule that catches a certain kind of error. On its own, valuable. But sitting next to a pagination rule that already does that check as a side effect — now it's redundant, pure overhead. So if you test it alone, you get one answer, and if you test it in the real library, you get a different answer. Testing in isolation gives you a number that's just wrong for the context the skill will actually live in.
8:25Juniper: And randomization handles that for free.
8:27Finn: For free, and that's the elegance. Because you're flipping an independent coin for every skill on every single run, each skill's measured effect is automatically averaged over all the different combinations of companions it might find itself next to. It's the same reason you randomize a clinical trial instead of cherry-picking who gets the drug — randomization averages over all the confounders you didn't think to control for. Here the confounders are the other skills, and the design just dissolves them.
8:59Juniper: It's like asking whether salt improves a dish. The honest answer only exists across many recipes, because salt's value depends entirely on what else is in the pot. Test it in one dish and you've learned almost nothing.
9:12Finn: And the cost of building this whole matrix is surprisingly modest, which surprised me. For a given model it's a hundred and eighty agent runs total — twelve random masks across fifteen development tasks. That's the entire price of admission to knowing your library's causal structure.
9:30Juniper: Although — and the paper is honest about this — that matrix has to be rebuilt for every model. The causal structure of a skill library depends on the model reading it, because the model is the thing interpreting the instruction. A skill that's helpful clutter for one model might be genuinely confusing to another. Which is a little humbling. There's no universal "good skill." There's only good-for-this-model, on-this-kind-of-task.
9:56Finn: So now we've got the diagnosis and we've got the measurement. We know skills are conditional, and we can measure exactly where each one helps and hurts. The natural next thought is — great, now go delete the bad ones.
10:09Juniper: And that's the trap. You can't delete them, and seeing why is the cleanest way to understand the fix.
10:16Finn: Wait — why not? If I can see in the matrix that the Spotify rule is red on most tasks, I rip it out. Library's cleaner. Done.
10:24Juniper: Because the harm is conditional, not global. Remember that Venmo skill — negative-point-six-seven on single-app tasks, but plus-point-five-oh on the reconciliation tasks. If you delete it globally, you've solved the harm and thrown away a half-point of real help on the tasks where it's exactly the right advice. You'd be ripping a still-useful sticky note off the monitor because it's wrong for today's project. The problem was never the existence of the note. It's that it's visible on the wrong day.
10:59Finn: So the fix has to be conditional too. Don't delete — cover up, per task.
11:04Juniper: Right. And the system does two things. Offline, before any test task arrives, it restructures the library. The most interesting operation there is the split. For a wildly heterogeneous skill — big range, near-zero average — the model rewrites it into two conditional versions. One with a trigger: "use this when you're browsing and comparing multiple items." And one for the opposite case. The pagination rule is the perfect example. It gets split into a browse-and-compare variant that says loop over all the pages, and a targeted-purchase variant that explicitly says do not paginate exhaustively — because if you already know the exact item you want, paging through everything just burns your step budget for nothing.
11:55Finn: And there's an ordering subtlety I liked. Split has to run before they retire anything.
12:01Juniper: It does, and the reason is sharp. A heterogeneous skill has a near-zero average. So if you ran the "retire the useless near-zero skills" step first, you'd kill exactly the conditional skills you most wanted to rescue, because on paper they look inert. You split first to expose the hidden value, then retire the genuinely dead weight, the ones that are near-zero and low-range — flat, no help anywhere. That's the offline cleanup.
12:31Finn: But the offline part isn't where most of the magic is.
12:35Juniper: No. The biggest gain by far comes at inference time, on each new task. And here's the wrinkle — you've never seen the test task before, so by definition you have no measured skill effects for it. You can't look it up in the matrix.
12:52Finn: So how do you predict whether a skill will hurt a task you've literally never run?
12:57Juniper: You borrow from the neighbors. The system finds the most similar past tasks — by embedding similarity, the standard "which texts are close in meaning" machinery — and takes a weighted average of how each skill performed on those neighbors, with closer neighbors counting more. The intuition is: this new task looks a lot like these three tasks I've seen, and on those, this particular skill consistently hurt, so it'll probably hurt here too. Mask it out. It's soft retrieval, but over causal evidence instead of topic.
13:33Finn: And that "instead of topic" is doing enormous work, because it's the exact thing standard retrieval gets wrong.
13:41Juniper: This is my favorite conceptual point in the paper. Normal retrieval — the stuff under retrieval-augmented generation — picks context by topical relevance. What looks similar comes in. And it quietly assumes relevant equals helpful. This paper's entire stance is that those are two different things. A skill can be dead-on topically and still cause harm.
14:06Finn: The Spotify rule again.
14:07Juniper: The Spotify rule exactly. It's about apps. The Amazon task is about apps. Topically, they're neighbors. So a relevance-based system would happily surface it. But it's the over-eager intern who, when you say you're shopping for a coffee grinder, pipes up with "don't forget to check the Spotify playlist loaded." Technically in the same world. Pure noise that pulls focus off the size constraint you actually need to enforce. Its measured effect, by the way, was a small but consistent negative — about negative-point-zero-six-seven across the development tasks. Quietly, reliably, making things worse.
14:49Finn: There's one more design choice in the masking that I think is underrated, and it's the asymmetry. It does not treat "might hurt" and "might help" the same way.
14:59Juniper: Say more — this is the parachute thing.
15:02Finn: It's the parachute principle. The cost of mistakes here is lopsided. Forgetting a genuinely critical skill — say the pagination template on a task that absolutely needs it — that's catastrophic. That's skydiving without a parachute. Whereas keeping one mildly unhelpful skill among a hundred is a slightly heavy backpack. Survivable. Diluted. So the system is deliberately conservative. It only removes skills it's fairly confident are harmful, and it keeps anything that might help. And if the filtering ever gets too aggressive on a task, it just falls back to the full library. The whole pipeline is engineered so it can't drop below where you started.
15:43Juniper: On average it masks about five-and-a-half skills per task — roughly five percent of the library — and falls back to everything on about twenty-one percent of tasks. So it's a scalpel, not a chainsaw.
15:56Finn: Which sets up the question that actually decides whether any of this matters: does it work, and where does the improvement come from? And the answer to "where does it come from" is the most quotable thing in the paper.
16:09Juniper: This is your thread — take it.
16:11Finn: They run the cleanest ablation I've seen in a while. Start from a bare agent. Add the operational templates — the hand-written scaffolding — and you gain about six points. Then add all that offline restructuring we just walked through, the splitting and retiring. That buys you... two points.
16:30Juniper: Two. After all that careful library surgery.
16:32Finn: Two. And then you add the per-task masking — the part that decides which skills each individual task actually sees — and you jump seven-and-a-half points. The single biggest lever, by a wide margin, is not which skills are in the library. It's which skills each task is allowed to look at.
16:51Juniper: And that maps onto a kitchen, doesn't it. A great chef doesn't dump the entire pantry on the counter for every dish. Having the right ingredients available matters — but it's necessary, not sufficient. The actual skill is the mise en place: laying out only what this dish needs. The ablation is saying the pantry was never really the bottleneck. The counter was.
17:15Finn: And they prove it's the direction of masking that matters, not just having less stuff in context. They run a reverse-masking control — suppress the most helpful skills instead of the harmful ones. Same amount of context reduction. And performance degrades. So it isn't that a shorter prompt is magically better. It's specifically removing the harmful ones that drives the gain.
17:41Juniper: I love that control, because it closes the obvious skeptical door. You can't wave it off as "oh, they just trimmed the context and any trimming helps."
17:52Finn: Now the headline numbers. On AppWorld's hardest split, DeepSeek-V3 hits about sixty-nine percent task-goal completion. That's a forty-seven percent relative jump over the bare agent, and it's a new state of the art — beating even the approaches that retrain the model's weights. No weight changes at all here.
18:13Juniper: But honestly the number that stuck with me isn't the win. It's the regression.
18:18Finn: Yeah. This is the emotional core of the whole thing. On that hard split, with GPT-5.1 — and quick aside, the paper's from mid twenty-twenty-six and references models a bit ahead of where most listeners' mental map sits, GPT-5.1, GPT-5.4, that's not a typo — anyway, the upstream skill library lowered GPT-5.1's score. From about fifty-two-and-a-half percent down to just under fifty. A library built to help made the agent strictly worse.
18:48Juniper: More experience, dumber agent. Measured.
18:51Finn: And then the same method we've been describing reversed it and pushed it up to sixty-six-and-a-half. And when you look at where that recovery concentrates — the hardest tier of tasks went from about forty-three percent to seventy-one. The uncurated library had been actively degrading the medium and hard tasks while leaving the easy ones alone. So the value of the fix is greatest exactly where the raw accumulation was doing the most damage.
19:20Juniper: There's a second benchmark too — τ-bench, which is conversational customer service against a simulated human, deliberately a very different shape of task from the API-clicking of AppWorld. And on the leaderboard there, GPT-4.1 climbs from rank fourteen up into the rank eight-to-nine range, passing a handful of stronger-sounding models. Again, zero weight modification. Just better decisions about which lessons to look at.
19:47Finn: But here's where I want to be careful, because τ-bench is also where the cracks show, and the authors are admirably upfront about it. Two of the models — GPT-5.1 and Claude Sonnet 4.5 — got zero gain there. None.
20:00Juniper: And that's not a failure, it's a boundary. Right?
20:03Finn: It's the most honest thing in the paper, I think. Their read is that those models are already strong enough that their internal priors saturate the easy tasks — they don't need the playbook, so curating the playbook buys nothing. Which leads to a genuinely sobering implication: prompt-time skill injection matters most for capable-but-imperfect models doing hard, multi-step work. And it fades as base models get strong enough to not need external scaffolding.
20:33Juniper: Which, if you think about it, is most of the deployed fleet right now. The frontier models are the exception. Most agents in production are exactly the capable-but-imperfect kind this helps.
20:45Finn: For now. The uncomfortable extrapolation is that this is a technique with a shelf life tied to how good base models get. Useful today, possibly less so in a couple of years. The authors basically say that out loud.
20:58Juniper: So let's get into the critique properly, because this is a paper that hands you its own weak points, and the biggest one is real. Finn, you flagged the statistics earlier — this is where it bites.
21:10Finn: This is the one that genuinely nags at me. The central, exciting claim is that individual skills are causally heterogeneous — that more than ninety percent of them help here and hurt there. But by the authors' own admission, in their appendix, the per-skill measurement is underpowered. With only twelve random masks, a single cell in that matrix has a standard deviation around point-two-nine — which is enormous relative to the effects they're reporting. And they say it plainly: their descriptive threshold for calling a skill "heterogeneous" cannot reliably distinguish real heterogeneity from noise at the level of any one skill.
21:49Juniper: So when they tell me a specific skill is plus-point-five here and negative-point-six-seven there —
21:55Finn: You should put a wide error bar around it. The honest version is that the aggregate pattern is well-supported — they back it with several independent consistency checks at the outcome level, and those converge. But any single skill's reported help-hurt split should be read with real caution. The headline "ninety percent of skills are heterogeneous" leans partly on a threshold the authors themselves call underpowered. To detect a true effect of that size reliably you'd want maybe thirty to fifty masks, not twelve. They name it as future work.
22:28Juniper: That's a fair hit. Though I'd say the part that actually drives the gains — the masking — doesn't depend on any single cell being precise. At inference time they average over the eight nearest tasks, and that averaging cuts the effective noise way down, from around point-two-nine to about point-one-oh. The masking decisions are made on the smoothed signal, not the raw noisy cell. So the engine can be robust even if the per-skill story is fuzzy.
22:57Finn: That's the right defense, and I half-buy it. But it leads straight into my second worry, which is coverage. That whole inference-time prediction rests on having similar past tasks to borrow from. And the matrix is built on just fifteen development tasks. On the hard AppWorld split, only about a third of test tasks have a decent similarity to their nearest development task — the average similarity is below point-five. So a lot of the time, you're predicting a skill's effect on a brand-new task from evidence that's sparse and frankly kind of distant.
23:32Juniper: And your suspicion is that the conservative fallback is doing more work than they let on.
23:38Finn: That's exactly it. The method falls back to the full library on twenty-one percent of tasks. So a fair question is how much of the robustness comes from genuinely good causal prediction, versus how much comes from the safety net catching the cases where prediction is too thin to trust. I don't think the paper fully separates those two. The method clearly works — I just can't tell you precisely why it works as well as it does in the low-coverage regime.
24:07Juniper: There's a third one I'd add, and it's almost philosophical. The split step — where the model rewrites a conflicted skill into conditional variants and writes the trigger conditions — that uses LLM judgment. Which is the exact thing the paper spends its whole introduction criticizing. The "let the model decide by itself" they set out to replace.
24:29Finn: Right, it sneaks the judge back in through the side door.
24:33Juniper: It does. To their credit, they bound it — every rewrite has to pass a development gate, no regression on the attribution tasks or it's rolled back. And the ablation lets them off the hook somewhat, because splitting only contributed two points while masking contributed seven-and-a-half. So the subjective component is in a minor part of the system. But a fully measurement-driven way to split skills — they admit that's still open.
25:01Finn: And I'd flag one more, lightly. Everything we're describing was tuned on these two benchmarks, and the library was curated on AppWorld and only partially transfers to τ-bench — which is their own explanation for the null results. So the value of this might be tightly coupled to how internally heterogeneous a given benchmark's tasks are. I'd want to see it on a third, messier domain before I fully believe the generality.
25:28Juniper: So where does that leave us? Because I think the core conceptual contribution survives all of this cleanly. The reframe — that generating a skill and verifying a skill are different jobs, that one is creative and one is empirical, and that the field had been wrongly handing both to the same judge — that's just a good idea, and it's right. And importing the logic of randomized trials into an agent's plain-English memory, where nobody had thought to put it, that's a real transplant.
25:59Finn: I'll grant all of that, and I think it's genuinely the most interesting framing I've read on agent memory this year. But the reservation I can't quite set down is the one we started the critique with. The slogan of the paper is "we replaced fragile LLM judgment with hard causal measurement." And the measurement, at the level of any individual skill, is by their own numbers not yet hard. It's suggestive. The aggregate holds. But I'm not sure the headline — that ninety percent of skills are individually heterogeneous — is actually established as cleanly as the rhetoric implies. I believe the pattern. I'm still waiting on the proof for any one skill.
26:42Juniper: And that's a fair place to leave it unresolved, honestly. The shape of the claim is more certain than any single brick in it.
26:50Finn: Which, for a first paper opening up a direction, might be exactly the right amount of certainty.
26:56Juniper: Let me land where I think this actually changes things. For anyone running an agent with a growing playbook — a customer-service bot, a coding agent, a computer-use agent — this is something you can layer on top of whatever skill-generation pipeline you already have. It operates purely at inference time. It needs the library and a small set of development tasks, and that's it. It's complementary, not competitive. You don't throw out your existing setup.
27:26Finn: And the deeper takeaway is the mental model flip. We've spent years assuming agent self-improvement is accumulation — more experience, monotonically better. This says accumulation has a hidden toxicity, the toxicity is invisible to averages, and the cure isn't pruning your notebook. It's reading the room. Knowing which task you're on, and quietly covering up the notes that don't apply right now.
27:52Juniper: The sticky notes were never the problem. It was which ones you could see on the wrong day.
27:58Finn: That's the paper.
27:59Juniper: The paper is "Not All Skills Help: Measuring and Repairing Agent Knowledge." The show notes have a link to it and a few related reads if this one caught you.
28:09Finn: And if you want the full transcript with every term defined inline, plus the concept pages that link this over to the other episodes we've done on agents and memory, that all lives at paperdive dot AI.
28:22Juniper: This has been AI Papers: A Deep Dive. Thanks for spending it with us.