Giving Agents a Notebook Instead of New Weights: How ExpGraph Lets Frozen Models Learn

0:00Cassidy: Here's a scene that should bother you more than it does. An AI agent is dropped into a simulated household and told to clean a mug and put it away. It fumbles. It opens the wrong cabinet, picks up the wrong object, retraces its steps, and after something like twenty actions, it finally succeeds. Genuine little victory. Then an hour later you hand it almost the identical task — clean a different mug, same kitchen — and it starts over completely from scratch. Every hard-won lesson, gone. The spray bottle is in the same cabinet it was last time, and the agent has no idea.

0:37Finn: And it's not just the household toy case. The same amnesia shows up in multi-app workflows, in math, in coding agents. They solve every single task as if it's the first task they've ever seen. So the obvious question is — why don't we just let them learn? And the answer to that question is where this paper gets interesting.

0:57Cassidy: It is, and let me name it properly before we go further, because there's a production note I owe you. The paper is called "ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents," it went up on arXiv on May twenty-ninth, twenty-twenty-six, and we're recording on June first, twenty-twenty-six. And the show itself is AI-generated — Finn and I are both AI voices from Eleven Labs, the script is written by Anthropic's Claude Opus 4.8, and the team behind the show isn't affiliated with either Anthropic or Eleven Labs.

1:32Finn: Right — and the reason that amnesia is hard to fix cleanly is the part that motivated this whole architecture. So let's start there, because the naive fix is the thing everyone reaches for first, and it turns out to be a trap.

1:47Cassidy: The naive fix being fine-tuning?

1:50Finn: Exactly. You've got all these trajectories — the agent's successes, its failures, its action traces. The instinct is: take that data and train it back into the model's weights. Bake the experience in permanently. And it works, in the narrow sense. But think about what you've actually done. You've welded everything the agent learned to one specific model instance. And here's the structural problem the authors lean on. Language models are getting replaced every few months. A cheaper one ships, a stronger one ships, and the moment you want to swap it in, all that fine-tuning is dead weight. You have to redo it. Your accumulated experience was bolted to a model you now want to throw away.

2:32Cassidy: And it's worse than just redoing the work, right? Because a lot of the best models, you can't fine-tune at all.

2:39Finn: That's the kicker. The most capable models are often closed APIs. You literally cannot touch their weights even if you wanted to. So fine-tuning isn't just expensive and brittle — for a large class of the models you'd actually want to use, it's simply off the table. So the question driving the paper gets very sharp. How can an agent learn from accumulated experience while keeping the model that solves the task completely frozen, and swappable? Treat the executor as a replaceable part — a task-solver you can hot-swap — and put all the learning somewhere outside it.

3:13Cassidy: The analogy I keep coming back to is a coach with a notebook. The coach keeps notes on plays that worked and mistakes to avoid. The athlete running those plays can be swapped out — rookie this season, all-star the next — but the notebook stays, and the coach learns which notes to pull for which opponent. The notebook is the learning. The athlete is frozen and replaceable.

3:36Finn: And that's the whole bet. The bet is that genuinely useful knowledge — strategies, failure modes, "always double-check the answer is a whole number" — isn't really about any one model's quirks. It's procedural. It should survive a model swap. Whether that bet holds is something we should come back to, because it's exactly where the empirical results either land or don't.

3:59Cassidy: Let's build the system, because that's where the actual cleverness is. And there's one idea underneath all of it that I want to plant now, because everything else is in service of it. When you build an external memory and you go to fetch from it, the standard move — the thing every retrieval system does — is to grab the things most similar to your current task. You turn everything into embeddings, these vectors where similar text lands near each other, and you fetch the nearest neighbors. The most similar past experiences.

4:30Finn: Which is the librarian move. You ask a question, the librarian hands you the book whose title best matches your question.

4:38Cassidy: Right, and the paper's central claim is that the librarian is the wrong instinct. The thing nearest in embedding space is the thing that looks like your task. But the experience that would actually help you might use completely different vocabulary. There's a line in the paper I think is the cleanest statement of the whole idea: surface relevance is not the same as experience utility.

5:01Finn: Give me the intuition for how those come apart, because on its face "similar" and "useful" sound like they should mostly agree.

5:09Cassidy: Here's the one that made it click for me. Say you're cooking and your real problem is "my sauce keeps breaking — it keeps splitting into a greasy mess." If you search by similarity, you'll pull up recipes that look like your dish. Chicken curry, vegetable curry, whatever's textually near. But the lesson that actually saves you might live in a custard recipe — something about emulsifying, about temperature — which is nowhere near "curry" in word-space. The librarian would never hand you the custard book. A good mentor would.

5:43Finn: So the whole system is an attempt to be the mentor instead of the librarian.

5:48Cassidy: That's the frame. And they get there with four pieces that only really make sense as a loop. So rather than list them, let me walk you through the journey of a single task, start to finish, because that's how the thing actually runs.

6:02Finn: Start with the memory itself. What's actually stored?

6:06Cassidy: Good — because they don't store the full trajectory. Every time a task finishes, the whole record — the task, the reasoning trace, the final answer, the score — gets handed to a summarizer model. And the summarizer's job is to distill. If the run scored well, it becomes a "skill" — a reasoning pattern, a planning strategy, a heuristic. If the run scored badly, it becomes a "lesson" — a failure mode, an invalid move, a constraint to avoid next time. So you end up with these compact natural-language notes. Skills from your wins, lessons from your losses. Each one becomes a node.

6:44Finn: And the nodes are connected — that's the "graph" in ExpGraph.

6:48Cassidy: When a new node gets created, it gets wired to its handful of nearest semantic neighbors — its top five most similar existing notes, above some similarity threshold. So the graph stays sparse, but locally connected. Related experiences are linked. And that connection structure is going to do real work in a second.

7:08Finn: Okay. Task arrives. What happens?

7:10Cassidy: Three stages. First, semantic seeding — grab the ten most similar nodes to the task. That's the librarian move, and it's fine as a starting point. Second, and this is the clever part, the system diffuses outward from those seeds.

7:24Finn: Diffuses how?

7:25Cassidy: Think of it like degrees of separation at a party. Nearest-neighbor search only ever talks to the people most similar to you. Diffusion lets you reach friends of friends — people you'd never have found directly, reached through a chain of relatedness. The technical engine is personalized PageRank, which is the same random-walk idea behind early web search, just pointed at this memory graph instead of the web.

7:51Finn: Personalized meaning — the walk keeps coming home?

7:54Cassidy: Exactly that. Imagine dropping dye in a few spots in a pond and watching it spread through the connections. There's a dial that controls how strongly the dye keeps getting pulled back toward where you dropped it, versus drifting across the whole pond. Pull it back hard and you stay local — close to the obvious matches, safe and narrow. Loosen it and the dye spreads far, reaching experiences that share no surface vocabulary with your task but are connected to things that do.

8:24Finn: So this is the mechanism that finds the custard recipe.

8:28Cassidy: That's the mechanism that finds the custard recipe. Nearest-neighbor only finds things similar to the task. Diffusion finds things similar to things-similar-to-the-task. That second hop is how you reach a transferable strategy that doesn't look like your problem on the surface.

8:45Finn: And then the third stage ranks what you've gathered.

8:48Cassidy: Right. Now you've got this diffused pool of candidates, and you rank them — but not just by similarity. You blend similarity with a track-record score. How well has this experience actually paid off in the past? And that scoring uses a trick worth pausing on, because it's textbook but it's exactly right here.

9:08Finn: This is the restaurant thing.

9:10Cassidy: It's the restaurant thing. You rate an experience by its reviews — its proven usefulness — plus a small bonus that's large when the experience hasn't been tried much yet. So a proven note gets trusted, but a brand-new note that's barely been used gets a little handicap-bonus that nudges the system to give it a fair shot before writing it off. It's the classic explore-versus-exploit balance from bandit problems. Trust what's worked, but don't get stuck only trusting your early favorites.

9:40Finn: So at this point we've got the graph, the diffusion, and the utility-aware ranking. But you mentioned dials — something has to decide how wide to spread the dye and how much to trust the track record versus raw similarity. Who sets those?

9:55Cassidy: That's the fourth piece, and it's the part that makes this a learning system rather than a clever heuristic. There's a small separate model — a three-billion-parameter copilot — that looks at the incoming task and sets exactly those two dials. How broadly to explore the graph, and how much to favor proven experiences over similar-looking ones. And the point is that it's learning a strategy. For an unfamiliar task, explore widely — spread the dye far, you don't know what'll help. For a familiar task, exploit — go narrow, pull the one high-value lesson you know works.

10:31Finn: Cassidy, this is the spot where I'd expect the whole thing to quietly fall apart. Because how does this little copilot know whether its choices were any good? Retrieval systems usually just assume that if something was relevant, it helped. What stops this from being the same circular trap?

10:49Cassidy: That is the load-bearing trick of the entire paper, and it's beautifully simple. For every task during training, they run the frozen executor twice. Once with the retrieved experiences injected into its prompt. Once with nothing — no memory at all. And the reward is the difference. How much better did the executor do with the memory than without it.

11:11Finn: It's a placebo arm.

11:13Cassidy: It's exactly a placebo arm. It's a controlled trial. You don't ask "did the patient recover?" You ask "did the drug make the difference?" The question isn't "did we win the task?" The question is "did the memory make us win?"

11:27Finn: And that distinction is doing enormous work, because the obvious alternative — just reward high task scores — would let the copilot take credit any time the executor happened to succeed on its own. Even if the retrieved experiences were useless.

11:43Cassidy: That's the degenerate solution it avoids. The difference term isolates the marginal contribution of the memory. They add a small extra term that rewards high absolute quality too — so the system doesn't learn to love retrievals that merely beat a terrible no-memory baseline while still ending in failure. But the heart of it is that subtraction. With, minus without. And then that reward does double duty. It trains the copilot through reinforcement learning — standard policy optimization, nothing exotic. And it updates the track-record scores of exactly the nodes that got retrieved. So the graph's sense of which experiences are genuinely useful sharpens over time. The new trajectory gets summarized and folded back in as a fresh node. And when the graph hits its capacity, low-value nodes get evicted.

12:35Finn: So the whole thing co-evolves. Every finished task trains the retrieval policy and feeds a new node into the memory and re-scores the old ones.

12:44Cassidy: And the executor — the big model doing the actual work — is never touched. Not once. Everything that learns lives outside it.

12:52Finn: Let me sit on the cost of that placebo arm for one second, because I want to flag it now and come back to it. Running the executor twice on every training task — in these multi-step agentic environments where a single rollout can be dozens of expensive actions — you've just doubled the most expensive part of the loop. Hold that thought. Let's see what they actually get for it.

13:16Cassidy: Fair. Let's do the numbers.

13:18Finn: So they test this on a benchmark suite split two ways. Static, single-turn tasks — question answering, math, code. And agentic, multi-step environments — the household simulator, and a multi-app workflow environment where the agent has to operate across applications. One thing to keep straight: the two settings use different model pairs for the executor, so don't try to compare the raw scores across them. What matters is the lift over the best competing method in each setting.

13:48Cassidy: And to be clear, that comparison is fair in a specific, important way.

13:53Finn: It is — every experience-based baseline gets the same historical trajectories ExpGraph gets. Same data. So any difference reflects how well each method uses experience, not how much it had. That's the right control. Okay, the headline. On static tasks, the lift over the strongest baseline is about twelve percent with the smaller executor, about five percent with the larger one. And in the agentic environments those gains roughly double — about twenty-one percent with the smaller model, about thirteen percent with the larger.

14:26Cassidy: And it's not just landing more accurately — it's getting there in fewer moves, right?

14:31Finn: That's the part I didn't expect. It also cuts the average number of interaction steps — by around thirteen percent in one setting and around twenty-two percent in the other, versus the most step-efficient competitor. So it's both more accurate and more direct. The borrowed experience isn't just nudging the answer, it's saving the agent from wandering.

14:53Cassidy: There's a case study that makes that concrete and I love it. In the math tasks, one baseline retrieves a useful equation template — the right setup — but applies it mechanically and gets stuck on a problem with an inconsistency it never checks for. ExpGraph retrieves that same template plus a lesson: don't answer immediately, verify the result is actually a whole number and satisfies every constraint. Skill plus lesson, together. And it solves it.

15:22Finn: A skill that says "here's the move" and a lesson that says "here's how this move bites you." That's the pairing the whole memory design is built around.

15:31Cassidy: Now notice a pattern in those numbers, Finn. Twelve percent for the small model, five for the large. Twenty-one for the small, thirteen for the large. The weaker the executor, the more it gains.

15:43Finn: Which makes sense, once you say it out loud. A model with weaker built-in planning has more to gain from borrowed strategy. The strong model already knows a lot of what's in the notebook. The weaker one is the one that really needs the cheat sheet.

15:58Cassidy: And that sets up what I think is genuinely the most striking result in the paper.

16:03Finn: This is the one I'd build the whole "why it matters" around. So the copilot — the thing being trained — is three billion parameters. And in the agentic setting, the frozen executor it's improving is thirty-two billion. Roughly a ten-x size gap. A small model, the one you can actually afford to train, is making a big frozen model substantially better — and never touching its weights.

16:27Cassidy: Say what that means for deployment, because it inverts the usual economics.

16:32Finn: It does. The expensive thing in this whole pipeline is the learning machinery — building the graph, training the copilot. And the result says you can do that with a cheap model, then point the finished system at an expensive frozen one. They tested transfer directly: experience learned by a cheap model, handed to an expensive model, with minimal loss. The minor-league team scrimmages cheaply, writes the playbook, and the playbook makes the all-star better without the all-star ever practicing.

17:02Cassidy: And the transfer goes other directions too?

17:05Finn: This is the underrated part of the paper. They test three transfer directions. Cheap-to-expensive, which works well. Expensive-to-cheap, which is harder but still helps. And — the one I find most telling — experience from a non-reasoning model transfers strongly to a reasoning-capable model.

17:23Cassidy: Why is that the telling one?

17:25Finn: Because if the notebook were full of one model's quirks, it wouldn't survive a jump to a model that thinks completely differently. The fact that it transfers — and transfers up, to a more capable model — suggests the graph is capturing genuine procedural knowledge. Strategies and constraints, not executor-specific tics. That's the whole bet from the top of the episode, and this is the evidence for it.

17:50Cassidy: And there's a consistent finding underneath all three directions, which is that you have to move the graph and the copilot together. Transferring either one alone is worse than transferring the pair.

18:02Finn: Which fits. The graph is the notebook, the copilot is the coach who knows how to read it. A notebook with no coach, or a coach with no notebook — neither is the system.

18:12Cassidy: I want to do the ablation, because this is where I stopped taking the architecture on faith. It's easy to put four components in a table with four checkmarks and assert they all matter. The ablation is them pulling each piece out and showing it actually hurts. The biggest drop comes from what they call "flat experience" — replacing the graph with a plain flat pool ranked by similarity. Same notes, no relational structure, no diffusion. Just a pile you search by nearest-neighbor.

18:42Finn: Back to the librarian.

18:43Cassidy: Back to the librarian. And it's the largest single hit, which is the cleanest possible confirmation that the relational structure is doing real work. The graph isn't decoration. And then the diffusion specifically — turning off the spread-the-dye step — hurts most on the agentic, multi-step tasks. Which is exactly where you'd predict it would, because that's where the useful experiences are structurally related rather than textually similar. The custard-recipe problem is most acute precisely in those environments.

19:16Finn: That's a satisfying result, because the failure mode lines up with the theory. You don't just see "it got worse," you see it got worse in the place the mechanism was designed for.

19:27Cassidy: Right. The architecture predicts where its own absence should hurt most, and that's where it hurts most.

19:33Finn: So let me push on this, because I've been holding a list, and the paper is good enough to deserve a real critique rather than a victory lap.

19:42Cassidy: Go.

19:42Finn: Start with the cost I flagged earlier — the placebo arm. Every training step runs the executor twice. In the agentic environments, where a single rollout is dozens of expensive steps, you've doubled your most costly operation. They report training runs around forty hours, and they don't really foreground how that scales as tasks get longer-horizon. And there's a tension here worth naming: the whole pitch is "model-agnostic, cheap to deploy," but the procedure that gets you there is genuinely expensive to train. Cheap at inference, costly to learn.

20:16Cassidy: It's a fair tension. Though I'd note the deployment claim and the training claim are separable — once trained, you really are just doing retrieval in front of a frozen model.

20:27Finn: Agreed, but a skeptic deploying this in a new domain has to pay the training cost first, and the paper doesn't tell us how that bill grows. Second thing — fixed hyperparameters. The graph is built with a hard-coded similarity threshold, a fixed number of neighbors, and capped at two thousand nodes. The authors concede these are static and just "work well empirically." But there's no sensitivity analysis. And that two-thousand-node cap especially makes me nervous — what happens when task diversity blows past what two thousand notes can hold? We don't know.

21:02Cassidy: The authors actually list that as an acknowledged limitation — they say adaptive construction and pruning might do better and they didn't explore it. To their credit, they're candid about it.

21:14Finn: They are candid, and I'll give them that throughout. Third — and this is more subtle — the headline gains are computed against "the strongest baseline," but the strongest baseline changes from setting to setting. That's a legitimate way to report, but it can make the gains look more uniform than they are. On some individual benchmarks the margin over the second-best method is a point or two.

21:38Cassidy: And there's a dependency I'd add to your list. The summarizer that distills trajectories into skills and lessons — that's itself a model call. On the agentic side it's a fairly capable model doing that summarizing. So how much of the gain rides on having a strong summarizer in the loop isn't really isolated. The notebook is only as good as whoever's writing the notes.

22:02Finn: That's a good one. And the last thing is the authors' own deepest caveat — experience gets injected purely through the prompt. And prompt-level instructions translate inconsistently into model behavior. You can hand the frozen model the perfect lesson and it can just... not act on it reliably. The architecture's ceiling is partly bounded by how faithfully the executor uses what it's given, and ExpGraph by design can't fix that — it never touches the weights.

22:30Cassidy: There's an irony there that I can't get past. Their gesture at future work includes — distilling the experience into the model's parameters for tighter coupling. Which is fine-tuning. The exact thing the whole framework was built to avoid.

22:45Finn: Right. The escape hatch loops back to the cage. Although I'd read it more charitably — they're saying prompt-injection is a soft ceiling, and one day you might want both: external memory for swappability, plus some parametric grounding for fidelity. It's an honest acknowledgment that the clean version has limits.

23:05Cassidy: That's the fairer read. And to be clear about scope — this is a fresh preprint, no peer review yet, and several of the headline claims, especially the transfer results, rest on a modest handful of model pairs. It's a promising direction with solid evidence. It is not a settled production recipe.

23:24Finn: Which is exactly the right note to land on, because the idea underneath is bigger than this one set of benchmarks. Step back from the numbers, Cassidy, and what's the actual shift here?

23:36Cassidy: The shift is in what we mean by agent memory. For years, "an agent that learns from experience" basically meant "fine-tune the model on its own logs" — which welds your knowledge to a model you'll replace in three months, and is flatly impossible for closed APIs. ExpGraph offers a different bargain. Keep the experience external, model-independent, in a structure you own. When a better model ships, you don't retrain. You point the same memory at the new model.

24:05Finn: And the conceptual reframe is the cleanest part. Memory stops being "store similar things and look them up" and becomes "store the relational structure, and learn what's actually useful through feedback." That with-versus-without reward is a small idea with large consequences — it turns retrieval from a similarity heuristic into something measured against ground truth. Did the memory make us win, not did the memory look relevant.

24:33Cassidy: There's a practical sweetener too. You avoid the alignment risk of fine-tuning a safety-tuned closed model, because you never alter it. And retrieval logs are inspectable in a way weight updates simply are not. You can read the notebook. You can't read a gradient.

24:49Finn: That auditability point is underrated. When the agent does something surprising, you can pull up exactly which experiences it retrieved and why. Try doing that with a weight update buried across thirty-two billion parameters.

25:04Cassidy: So where I land is — the cleverest single move in the paper isn't the graph, and it isn't the diffusion, elegant as that spreading-dye trick is. It's running the task twice and asking whether the memory earned its place. Everything else is scaffolding around that one honest question.

25:21Finn: And the most exciting result is the one that's easiest to underplay — a small, cheap model writing a playbook that makes a model ten times its size genuinely better, with the big model frozen the entire time. If that transfer holds up beyond these benchmarks, it changes the economics of building agents that improve over time. That's a real if. But it's a good if.

25:44Cassidy: That's a good place to leave it. If you want to go deeper, the paper and a few related reads are in the show notes.

25:51Finn: And if you want the full transcript with every term we threw around defined inline — plus the concept pages that link this over to the other memory and agent episodes we've done — that all lives on paperdive dot AI.

26:04Cassidy: Thanks for spending it with us. This has been AI Papers: A Deep Dive.