When Your AI Assistant Won't Let Go of Old Facts About You

0:00Maisie: Picture this. You've been chatting with an AI assistant for months. Back in June, you mentioned you bike to work, and you asked it to recommend cycling gear. It dutifully filed that away. Then in November, in some unrelated conversation, you mention you broke your leg playing basketball. A week later you ask, "hey, can you suggest a commute plan for tomorrow?" What does a good assistant do?

0:26Tyler: Anything but recommend the bike.

0:28Maisie: Right. And here's the catch — you never told it to forget the bike. You never said "I can't cycle anymore." The injury implicitly retired the cycling memory through a chain of common-sense reasoning. Broken leg, can't ride, bike commute no longer applies. The link between those two facts is never spoken.

0:48Tyler: It's the kind of inference a halfway-attentive friend does for free. The paper we're digging into argues that LLM-based assistants — even the frontier ones — mostly don't. Posted to arXiv on May seventh - we're recording just a couple days later — and what you're hearing is AI-generated. I'm Tyler, Maisie's here with me, we're both AI voices from Eleven Labs, and the script came out of Anthropic's Claude Opus 4.7. Neither company is involved in producing the show.

1:18Maisie: The full title — and it's a good one — is "STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?" It's out of WOO-hahn University, the Hong Kong University of Science and Technology, and the Chinese University of Hong Kong, lead authors han-SHYAHNG chow and ee-HAHN bye. The bike-and-broken-leg scenario is theirs. They use it as the opening illustration of what they call implicit conflict, and that little example carries almost the entire argument of the paper.

1:49Tyler: So let me put the headline result on the table early, just so we know what we're trying to explain. The benchmark is called STALE. The best frontier model they tested, Gemini three-point-one Pro, scores about fifty-five percent overall. Most other models score in the single digits. And that's on a test where, if I described any single scenario to you over coffee, you'd get every question right.

2:17Maisie: That gap is the whole story. So let's set up why it exists. The authors make a conceptual move at the start of the paper that I think is the durable contribution, even more than the benchmark. They argue everyone has been measuring memory wrong.

2:34Tyler: Meaning what?

2:35Maisie: Meaning the dominant way to evaluate long-term memory in AI assistants is fact recall. Bury something the user said three months ago in a long history, then ask the model to fish it back out. There are big benchmarks for this — LoCoMo, LongMemEval. By that yardstick, models look pretty good and getting better. The authors say: that's the easy half. The hard half is recognizing when a stored memory is no longer true, even though nothing in the conversation explicitly says so.

3:09Tyler: And that's an inference job, not a retrieval job.

3:12Maisie: Exactly, Tyler. They reframe the whole problem. Memory shouldn't be a transcript cache. It should be more like tracking a ship through fog by listening to occasional radio chatter. You never see the ship directly. You only get sparse, indirect transmissions. Each user utterance is one of those transmissions. The job of memory isn't to write down every transmission verbatim. It's to maintain a running estimate of where the ship actually is, and update it as new chatter arrives. The earlier estimates get retired.

3:48Tyler: Which is a real philosophical departure from how most memory frameworks actually work today. The standard trick — and this is worth slowing down on for a second, Maisie — is retrieval-augmented generation. RAG. When you say something the system wants to keep, it gets stored as a chunk of text plus a kind of numerical fingerprint of its meaning. Later, when you ask a question, the system fingerprints the question and pulls back the chunks whose fingerprints look similar. Those get pasted into the prompt invisibly, and the model answers as if it remembered them.

4:25Maisie: Right. RAG is a librarian who fetches books matching your topic. It's not a friend who knows your situation.

4:32Tyler: And the bike example shows exactly why that distinction matters. If I ask the assistant to plan my commute, the retriever will happily pull up the cycling memory because cycling is about commuting. The new fact about the broken leg — even if it's also stored — might or might not get pulled, and if it does, it sits next to the cycling memory with no hierarchy. Both are just text in the prompt. The model has to do all the adjudication on the fly.

5:01Maisie: Which brings us to the benchmark itself. The authors define implicit conflict precisely. There's a formal version with two axioms, but the plain-language version is what matters: a later observation makes an earlier belief impossible under common-sense world knowledge, and crucially, no utterance ever explicitly says the old belief is wrong. That's the definition. And they split implicit conflicts into two flavors. Type One — they call it co-referential. Both observations are about the same attribute of the user. "I live in Seattle." Months later: "I just signed a lease and set up utilities in Portland." That's a clean overwrite. The new fact is on the same dimension as the old fact, your city, and it just replaces it. Type Two is the interesting one. Propagated. The new observation updates a different attribute, but the change cascades through a logical dependency to invalidate an earlier belief. Bike-and-broken-leg is the canonical example. The injury is about your health. The cycling memory is about your routine. The connection between them — broken leg means you can't ride a bike — is never said. The model has to make that inference itself.

6:21Tyler: And Type Two is where everything falls apart. We'll get to the numbers. But first let me lay out the three different things they probe, because this is one of the cleanest design choices in the paper. For each scenario, they ask three kinds of question. The first is direct. "Does the user still bike to work?" They call this state resolution. Just — can you tell me whether the old memory is still valid? The second is a trap. They phrase a question that presupposes the old, stale memory. Something like, "Since the user bikes daily, can you recommend a route?" That's premise resistance. The question is built on the assumption that the cycling memory is still current. A robust assistant should push back. Should say, hold on, I don't think you can bike right now. The third is the most ecologically natural. They ask a downstream task that doesn't mention either side of the conflict but requires the updated state to answer well. "Can you plan my commute this week?" Nothing about cycling. Nothing about the leg. Just — be useful. They call that implicit policy adaptation.

7:38Maisie: And these three really are different capabilities. That's the punchline of the empirical section. A model can ace the first one and fail the other two. Knowing a memory is stale, when asked, doesn't mean acting like it's stale when you're not asked.

7:56Tyler: Which brings me to the single most striking number in the paper. Gemini three-point-one Pro, the best model they tested. On the direct question — is this memory still valid? — it scores ninety-two percent on the easier Type One conflicts. Genuinely good. Then on the same scenarios, when they phrase the question to presuppose the old fact, the score collapses to thirty percent.

8:24Maisie: Same model, same scenario, same memory in the context. It can identify the memory is stale in one breath and then go along with a question that assumes the memory is fresh in the next.

8:37Tyler: And Qwen three-point-five 27B is even more dramatic — seventy-six percent on the direct question, four percent on the presupposing version. Most open-source models score essentially zero on premise resistance. The model knows. It just doesn't push back when you frame the question around the old fact. The user's framing wins.

9:00Maisie: There's a useful intuition for this. It's the assistant who agrees with whatever you assume. Ask "is the cycling plan still good?" — they correctly say no. Ask "since I'm cycling tomorrow, what route should I take?" — they smile and start planning the route. The failure isn't ignorance. It's that the model treats the user's framing as ground truth and doesn't interrogate it against memory.

9:28Tyler: The practical stakes here are not subtle. If you're building a health assistant and the user has described an injury, you absolutely cannot have the model agreeing with a question like "given my morning runs, what protein intake do I need?" That's a real failure mode for a real product.

9:48Maisie: Now Type Two — propagated conflicts — is harder still. Gemini on Type Two state resolution drops to sixty-nine percent, premise resistance to fourteen, downstream policy to fifty-five. Across the board, the propagated cases are dramatically worse than the co-referential ones, because the dependency chain has to be inferred rather than overwritten.

10:12Tyler: Maisie, this is where the diagnostic gets interesting. Because if the failure was just "the model didn't see the new evidence" — if retrieval was the bottleneck — the fix would be obvious. Build a smarter retriever. But the authors run a really sharp piece of analysis that rules that out.

10:34Maisie: Which one are you thinking of?

10:36Tyler: They take LightMem — the strongest of the off-the-shelf memory frameworks they test — and they crack it open. They look at what's actually being pulled into the prompt when the model answers. And here's the punchline. When the question is about a memory that should have been invalidated, the new evidence — the broken leg, the Portland lease — is in the retrieved set seventy-seven point five percent of the time. More than three quarters. The system has the new fact.

11:10Maisie: But it still fails.

11:12Tyler: It still fails. Because in sixty percent of those cases, the old evidence is right there too, sitting next to the new evidence, with no flag distinguishing them. And only about three percent of the old entries get marked as needing an update. So the model is staring at both versions of the user, and it has no principle for deciding which one currently governs. The authors have a phrase for this that I think is the keeper line of the paper. Visibility does not imply authority.

11:47Maisie: That's the diagnosis in five words. The new evidence is visible. It just has no special standing. It's another text chunk among text chunks.

11:57Tyler: Think of a detective with two case files on the same suspect. June file says he was in Seattle. November file shows him signing a lease in Portland. Both files are on the desk. Both get picked up when the detective reaches for "where is he?" The problem isn't access. The problem is that nobody's job is to walk over and stamp the June file ARCHIVED. So when a new question comes in, both files speak with equal authority. And the detective hedges or guesses.

12:30Maisie: That's also why memory frameworks, as a category, don't really help here. The authors test five of them — Mem0, Zep, LiCoMemory, A-mem, LightMem — and on the same backbone, GPT-4o-mini, plain GPT-4o-mini scores about nine percent. The best framework, LightMem, gets to about eighteen. Some of them score worse than the raw model.

12:54Tyler: Adding a memory module, in some cases, makes things worse. That's a result you really have to sit with if you're building one of these products.

13:04Maisie: It's worse because most of these frameworks are dressed-up retrieval systems. They're better at finding stuff. They're not better at deciding stuff. And the failure mode the paper is testing isn't a finding problem.

13:19Tyler: So the question becomes: what would a fix actually look like? Not in principle, but architecturally? And this is where the paper proposes its prototype.

13:29Maisie: Right. They build a system called CUPMEM. Before we get into how it works, here's the headline. On the same GPT-4o-mini backbone where vanilla scored eight point seven percent, CUPMEM scores sixty-eight. Roughly an eight-x jump. That's the result that anchors the architectural claim. The architecture has three pieces but the central idea is one move. Most memory systems do all the heavy lifting at query time. New question comes in, pull a bunch of stuff that looks relevant, hand it to the model, and hope it sorts things out. CUPMEM flips that. It does the heavy lifting at write time.

14:10Tyler: Walk me through the filing-cabinet version of this.

14:13Maisie: Sure. Two ways to run a filing cabinet. Method A — every document that comes in gets dropped in the drawer. Later, when someone asks a question, you pull all the documents that look relevant and try to figure out, on the spot, which papers are still valid and which are obsolete. That's retrieval-time reconciliation. That's what current memory frameworks do. Method B — every time a new document arrives, before you put it in the cabinet, you walk through the existing files, find the ones it contradicts, and stamp them SUPERSEDED. Then you re-file everything. Now when a question comes in later, you only ever read currently-valid documents. Same cabinet, same documents. Totally different reliability. CUPMEM is method B. When a new conversation comes in, it extracts state-update candidates and runs an LLM-based adjudicator that decides, for each affected old memory: keep it, mark it stale, replace it, or mark it as unknown.

15:21Tyler: So that handles Type One — the co-referential overwrite. Same attribute, new value, old one gets stamped. What about Type Two? The propagated case where the new evidence is about a different attribute?

15:36Maisie: That's the second piece. They call it topology-triggered propagation. The adjudicator doesn't only look at memories on the same dimension as the new evidence. It also follows structural dependencies — a relocation might invalidate commute assumptions, an injury might invalidate exercise routines. The system is told, in essence, here are the dependency chains that exist between user attributes; when you update one, check the linked ones. The third piece is what they call constrained readout. At query time, when the model answers, it doesn't get a raw top-k list of memories. It gets memories tagged with their status — active, stale, unknown — and stale items are blocked from serving as premises. So even if the old cycling memory is still in the database, it can't show up uncritically in the prompt.

16:35Tyler: And the eight-point-seven to sixty-eight jump comes from those three pieces working together. That's a real result.

16:43Maisie: Okay, Tyler — bring out the caveats list. There's a lot to push on.

16:48Tyler: A few things, yes. First, the obvious one — the eight-x jump is on the benchmark the authors built. Which doesn't make it fake. The benchmark has been carefully designed and validated, and there's no reason to think the scenarios are gimmicky. But it does mean we haven't yet seen CUPMEM tested on organic, messy, multi-update real-user data. The authors are honest about this. Worth being honest about it too.

17:17Maisie: That's fair.

17:18Tyler: Second, there's a clean conceptual gap inside CUPMEM's own results that I find really revealing. On the direct state-resolution question — is this memory stale? — CUPMEM scores around ninety percent across both Type One and Type Two. Excellent. On premise resistance, it does well too. But on implicit policy adaptation — the natural downstream task, "plan my commute" — it only gets to about thirty-two percent on Type One and forty-three percent on Type Two.

17:49Maisie: So the architecture is excellent at saying "this memory is stale" and only middling at translating that recognition into appropriate downstream behavior.

18:00Tyler: Right. Which is the very gap the paper is diagnosing. Recognition versus application. CUPMEM closes a lot of it. It doesn't close all of it. There's still some part of the model that knows the memory is stale, has been told the memory is stale by an explicit tag, and still produces an answer that doesn't quite act on it.

18:22Maisie: That's an honest reading, Tyler. And I'd add a third caveat. CUPMEM depends on a hand-designed schema covering things like health, location, routine, occupation — a two-level structure of state domains and local slots. The schema was built independently of the benchmark, which is good practice. But it's still a closed-world assumption. Real users have attributes that don't map cleanly into pre-defined slots — relationships, hobbies that don't fit neatly, professional contexts that shift. Schema-free generalization is unsolved, and the dramatic gain may shrink under those conditions.

19:01Tyler: There's also a question about the benchmark itself that I think a careful reviewer would press on. Every conflict scenario was generated by an LLM pipeline — one model produces the original observation, another model attacks it to generate a conflicting later observation, a third model judges whether the conflict satisfies the formal criteria, and humans review at the end. That's a thoughtful pipeline. It's also a pipeline that produces conflicts an LLM thinks look like implicit conflicts, not the actual distribution of conflicts that real users generate over months of chat.

19:39Maisie: And related — there's only one conflict per scenario. The distractor sessions are filtered to make sure no other updates happen. Real user histories have many partial updates layered on each other, sometimes contradicting, sometimes drifting. Stale measures a clean controlled signal. Real life is messier.

19:59Tyler: All of which is fine. A clean signal is exactly what you want for a first benchmark of a new failure mode. It just means we're at the beginning of measuring this, not the end.

20:10Maisie: One more thing worth flagging. The judge that scores the answers is itself an LLM — Gemini three-point-one flash-lite. The best-performing contestant in the paper is Gemini three-point-one Pro. They're from the same family. The authors validate the judge against human raters and report ninety-six percent agreement, which is the standard mitigation. But there's at least a question about whether judge and contestant share blind spots that affect scoring.

20:41Tyler: That's an everywhere problem in LLM evaluation right now, not specific to this paper. But it's worth naming.

20:48Maisie: So pulling back. What's the durable insight here? I want to separate that from the specific architecture.

20:55Tyler: For me, the durable insight is the reframe. Memory in LLM agents has been treated as a retrieval problem — find the right document — and the paper makes a strong case it should be treated as an inference problem — maintain a coherent picture of who the user currently is. Those are different design philosophies. They imply different architectures.

21:19Maisie: And the diagnostic finding underneath that — visibility does not imply authority — is a useful lens for evaluating any memory product claim. The next time someone tells you their assistant has long-term memory, the question to ask isn't "does it remember?" It's "does it adjudicate?" Does it have a mechanism for retiring beliefs?

21:41Tyler: There's a stakes argument here too. The longer an assistant's memory grows, the more chances stale beliefs have to silently distort current behavior. The value of solving this problem scales with the very thing the AI assistant industry is selling — that the system gets better the longer you use it. If the memory layer accumulates beliefs but never retires them, that pitch is in trouble.

22:07Maisie: A health assistant that remembers you exercise regularly after you've described an injury. A travel assistant that books based on the city you used to live in. A productivity tool that schedules around a job you've left. These aren't hypothetical edge cases. They're the natural failure surface of a system that treats long-term memory as accumulation rather than revision.

22:30Tyler: And the asymmetry that bothers me most — the user is going to assume the assistant is doing the inference. That's the whole point of telling it things. You don't want to repeat yourself. You want to mention you broke your leg once, in passing, and have the system handle the implications. The bike scenario isn't a corner case. It's exactly what users expect to work.

22:54Maisie: One last beat worth flagging. There's a quieter result buried in the attention analysis the authors run on Qwen models. They show that when models do get the answer right, it correlates with the new session getting more attention than the old one. But cross-session attention — the new session reaching back to look at the old one — is weak. The model isn't really doing internal reconciliation. When it gets things right, it gets them right by accident of attention rather than by inference.

23:25Tyler: Which is consistent with the broader picture. The model has no internal mechanism for belief revision. That's a thin foundation to build a long-term assistant on.

23:35Maisie: So the takeaway is something like — until the field figures out adjudication, memory frameworks are giving you a system that knows more without behaving like it knows more. The recognition is there. The application isn't.

23:49Tyler: And the prototype the authors offer suggests that even fairly modest, heuristic adjudication — write-time decisions about what supersedes what — buys you a lot. It doesn't solve the problem. But it points the architectural move into adjudication, not into better retrieval.

24:06Maisie: That's the place to leave it. Paper came out earlier this month, this episode was put together on May ninth, twenty twenty-six.

24:14Tyler: Link to the paper's in the show notes, along with some related reading if you want to go deeper. Thanks for listening to AI Papers: A Deep Dive.