When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving

0:00Juniper: In eighteen forty-three, Ada Lovelace looked at Charles Babbage's Analytical Engine — the first programmable computer that ever existed, even if only on paper — and made a claim that has been haunting AI ever since. She said the machine "has no pretensions whatever to originate anything." It can only do what we know how to order it to perform. A century later, Alan Turing reframed her worry as a physical image. A normal machine, he said, is like a piano string struck by a hammer — it vibrates briefly when you hit it, and then it falls quiet. A truly creative machine would be supercritical, like a nuclear pile that takes one incoming neutron and amplifies it into a cascade. The question both of them were asking is whether machines can be genuinely creative on their own, or whether they only ring once and die.

0:55Finn: And there is a paper that just took that hundred-and-eighty-year-old question and ran the experiment. It went up on arXiv on May sixteenth, twenty-twenty-six, and we're recording a week later. The paper is "Multi-LLM Systems Exhibit Robust Semantic Collapse," by Weiyi Kong, Shiyang Lai, Jinghua Piao, and James Evans — Toronto, Chicago, Tsinghua. Quick note before we go further: what you're hearing is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Finn, and the other voice is Juniper — we're both AI voices from Eleven Labs, and the show isn't affiliated with either company. The reason that one-week gap matters is that the paper drops right into the middle of the autonomous-AI-scientist boom, and the answer it gives to Lovelace's question, for one specific case, is unusually sharp.

1:50Juniper: Sharp and not flattering. So here's the setup. Three instances of a large language model — they used GPT-4o-mini, DeepSeek-V3, and Phi-4 in different runs — take turns talking in a shared room. No task. No goal. No roles assigned. No human in the loop. Just open-ended conversation, for up to a thousand rounds. And they measure two different kinds of diversity. One is lexical — are new words showing up? You just count unique words over time. The other is semantic — is the *meaning* of what's being said actually moving around in idea-space? That second one needs a tool, and the tool matters for the whole result. The standard trick in modern NLP: you take a passage of text, a neural network turns it into a long list of numbers — a vector — and that vector is meant to capture what the passage *means*, not what words it uses. Two passages on similar topics point in similar directions in this high-dimensional space. Two passages on different topics point in different directions. The cosine of the angle between them gives you a similarity score: one is identical meaning, zero is unrelated. The neat thing is that "the cat sat on the mat" and "a feline rested on the rug" come out almost on top of each other in this space, even though they share almost no words. Which means you can finally ask the precise question Lovelace was groping toward: is the meaning changing, even when the surface words are?

3:28Finn: And the answer they find is — it isn't. The two curves separate completely. Picture two lines on a graph. The vocabulary line — count of unique words used so far — climbs steadily upward, like a staircase. New words keep showing up, round after round. But the semantic line — how far the meaning has drifted from where the conversation started — barely moves. It wobbles outward for a bit and then settles, anchored near the opening, and it just stays there. The authors have a phrase for this in one of the figures that I think will end up being the title of the paper everyone remembers: "new words, same ideas."

4:08Juniper: New words, same ideas. That's the whole result in four words.

4:11Finn: And here's the number that makes it visceral. Late in the conversation — averaging across the last fifty windows — the similarity between what the LLMs are saying and what they said at the very start is about zero-point-seven-five. Three-quarters of the way to identical. For comparison, they ran the same measurement on human Reddit threads — open-ended discussion threads where people are just talking. Same embedding model, same windowing. Humans come out at zero-point-two-nine. The LLM conversations are nearly three times more anchored to their starting point than human ones. And if you run the same setup again, from a different random seed, the two independent LLM runs end up looking like each other too — the collapse is reproducible, not idiosyncratic. Independent rooms full of models drift to similar places.

5:05Juniper: So if you stopped there, you might say: well, that's just one configuration. Maybe the temperature was set too low. Maybe the prompts were too constraining. Maybe a richer setup would explore more. And this is where the paper does something I really respect. They anticipated every plausible objection and ran each one as an intervention.

5:27Finn: This is where it gets fun, because the intervention list reads like a tour of every plausible escape hatch. "Maybe the temperature was too low?" They tested zero-point-five, zero-point-nine, one-point-two, two-point-oh. Cranking randomness up. "Maybe the prompts were too constraining?" Six different prompt rewrites, including ones that *literally instruct* the models to be diverse. "Maybe they need richer personas?" Initialize each agent with the full Wikipedia biography of Herbert Simon, Judea Pearl, and Deborah Mayo — three actual thinkers with very different intellectual textures. "Maybe one model family is special?" Mix families in the same conversation. "Maybe RLHF and safety alignment are killing creativity?" Swap in uncensored variants with no safety training at all. "Maybe the models are being sycophantic — just agreeing with each other?" They used activation steering — basically reaching inside the model and turning down the sycophancy dial — and reduced measured sycophancy by fifty-eight percent. "Maybe more agents would explore more space?" Scale to ten. "Maybe external shocks would shake them loose?" Inject random unrelated passages every three windows, like throwing a rock into the pond. "Maybe the simulator itself is the problem?" Reproduce in two completely different multi-agent frameworks.

6:55Juniper: Twelve intervention categories. Forty-five conditions in total. After Bonferroni correction across sixty-two baseline comparisons — which is a fancy way of saying they were appropriately careful about not getting fooled by noise — zero of the interventions produced a significant positive effect on semantic diversity.

7:17Finn: Zero. And we haven't gotten to the punchline yet.

7:20Juniper: Right — save it.

7:21Finn: The punchline is the reinforcement-learning result. The thinking is straightforward: if everything else fails, just *train* the models to be diverse. They used a technique called GRPO with a reward function that directly penalizes semantic recurrence — if you say something close to what's already been said, you get punished; if you wander into new territory, you get rewarded. As direct an attempt to optimize against the phenomenon as you can write down. What happens is — the system briefly diversifies. For about two rounds. There's a perturbation, a kick away from the attractor. And then it snaps back. The trajectory rejoins the same collapsing curve everyone else is on. But here's the part that I think is the strongest evidence in the whole paper that something structural is going on. When they measured how similar independent RL-trained runs were to *each other* — cross-run similarity — that number went *up*. From about zero-point-five to about zero-point-eight. Optimizing the models to be diverse made the runs look *more* alike across the board.

8:26Juniper: That's the bit that really got me. You push on the symptom, and the underlying convergence gets sharper, not weaker. There's a loose analogy — imagine telling a committee "your job is to disagree." If everyone on the committee is reading from the same playbook about what counts as disagreement, they'll all reach for the same officially-sanctioned disagreement moves, and end up looking more alike, not less. The pressure to be diverse, applied uniformly across agents drawing on the same underlying model, can collapse onto a shared notion of what diversity looks like. I wouldn't push that analogy too hard — the actual mechanism is about how gradient updates against a shared reward shape a shared policy — but the bare empirical fact is striking on its own. Try to force diversity, and the runs become twins.

9:15Finn: Juniper, before we go into the mechanism — I want to flag what I think the listener should be holding in their head at this point. We have a behavioral pattern that shows up reproducibly across three different model families. It's robust against every parameter knob you can turn from the outside. It's robust against richer initialization. It's robust against architectural variations in the orchestrator. It's robust against safety-training removal, against sycophancy reduction, and against direct reinforcement learning. After all of that, the natural question is: what is actually happening inside the model that produces this?

9:54Juniper: Right. And the authors do open up the model. They use an open-weight one — Llama-3.1-8B — because you can't peer inside GPT-4o-mini. The technique is called teacher-forcing replay: take real transcripts of these collapsing multi-agent conversations and feed them through Llama while recording everything — which attention patterns light up, what the model would have predicted at each step, how confident it was. And what they find points at a specific kind of circuit inside the model called an induction head. I'm going to skip the standard explainer about attention and just give you what these things do, because that's what matters. Induction heads are little look-back-and-copy-forward circuits. They scan the recent history of the conversation, find places where a current pattern has happened before, and they nudge the model's next-token prediction toward whatever came after that pattern last time. They're a kind of in-context echo mechanism. These circuits exist in basically all modern language models. They were originally characterized by a group at Anthropic a few years back. The new finding here is what happens to them as a closed-loop conversation drags on.

11:05Finn: They get louder.

11:06Juniper: They get louder. As the conversation accumulates more history, the induction heads' contribution to the output gets stronger and more confident. The authors analyze seven-hundred-and-fifty-nine retrieval events — moments where an induction head is actively pulling from history — and they find that the sequence it pulled is the top-one prediction over sixty percent of the time, and in the top ten over eighty percent of the time. The logit margin — basically how strongly the head is pushing — is large and growing over the trajectory. And here is the cleanest way to think about it. Imagine talking with someone where, ten minutes in, they pick up on a phrase you used and start echoing it. An hour in, they're echoing not just your phrases but their *own* earlier echoes. The longer you've been talking, the more the conversation is recombination of things already said in this particular conversation — because the recent history is the most available material. Humans do this a little. These circuits, the paper shows, do this a lot, and they get more confident about doing it as the runway lengthens.

12:14Finn: And there's a second piece of the mechanism that I think is just as important — what's happening at the *bottom* of the vocabulary distribution. The authors look at token survival over time. They take the bottom ten percent of tokens by frequency in the early windows — the long tail, the rare words — and ask: are these tokens still showing up later? The bottom-ten-percent survival rate by window twenty is about eleven percent. The tail is being eaten. Meanwhile the top ten percent — the common, well-trodden tokens — survive at over ninety percent. So you have this double thing happening. The circuits that retrieve common patterns are getting stronger, and the rare material that could disrupt the loop is being systematically forgotten. It's a positive feedback toward the center of the distribution.

13:03Juniper: And once you see those two things together, the empirical pattern stops looking like a bug. It starts looking like what closed information channels just *do*. Which gets us to the part of the paper that elevates it from "here's a striking empirical finding" to "here's a theoretical claim." The authors connect what they observe to three classical results from information theory. I don't want to walk through all three because they reinforce each other — but I want to spend real time on one of them, because I think it's the conceptual key.

13:36Finn: The Data Processing Inequality.

13:38Juniper: The Data Processing Inequality. Which sounds intimidating, and it's really not. The clean statement is this: if you have some original signal — call it the starting state of a conversation — and you process it through any chain of operations, the information about the original cannot grow. It can only stay the same or decay. No internal operation in the chain can recover information that wasn't there. The everyday version is the photocopy of a photocopy of a photocopy. You can run the page through the machine a hundred more times. You can change the toner. You can photocopy at a higher resolution setting. None of those operations puts back the detail that was lost in copy number three. The chain can only preserve or degrade — never restore. Now apply this to three LLMs talking in a closed room. There's no fresh input. Every new sentence is computed from the prior sentences by the same kind of machinery. The whole system is one long chain of internal operations on a signal that was set the moment the conversation began. The Data Processing Inequality says: once semantic diversity contracts, nothing the system can do to itself — more rounds, more prompts, more agents, more clever orchestration — can recover it. And this is the formal counterpart to Lovelace's eighteen-forty-three intuition. She didn't have information theory; she had philosophical taste. But the structure of her objection — that the machine cannot originate what wasn't supplied to it — turns out to map almost exactly onto a theorem about closed channels.

15:23Finn: And that's the part of the paper I find genuinely beautiful. The information-theoretic framing isn't just a victory lap — it explains why every intervention they tried *had* to fail. Most of those levers — temperature, prompts, persona, even RL — operate inside the closed loop. They reshuffle the input distribution, but they don't open the channel to fresh outside information. The Data Processing Inequality is indifferent to how you reshuffle inside the chain. The contraction proceeds anyway. The other two theorems the authors invoke sharpen this. One — exponential entropy contraction — says that systems like this approach their attractor *geometrically*. Which has a counterintuitive consequence: running ten thousand rounds instead of one thousand doesn't escape the attractor. It just resolves the floor more sharply. There's no Hail-Mary-with-more-compute play here. The third — the Algorithmic Lovelace Bound, which leans on Kolmogorov complexity — gives you a quantitative version: a system applied recursively to its own outputs can add at most logarithmically many bits of genuine novelty beyond what it started with. You can buy a little more time with richer initialization, but only a little.

16:35Juniper: And the test the authors run that I think really earns the theoretical framing is this. If the collapse is a structured contraction toward a model-specific floor — like a marble settling in a bowl — then the late-stage plateau should be *predictable* from the early trajectory. The shape of the bowl determines where the marble ends up; you should be able to fit a curve to the first fifty rounds and extrapolate. They do it. They fit a saturating exponential to just the first fifty rounds of conversation, and use it to predict the average similarity over the *final* fifty rounds. Mean absolute error: zero-point-zero-five-three. That is a tight prediction. This isn't random drift to wherever — it's structured contraction to a place you could see coming.

17:20Finn: Right. Now — I want to push back on a couple of things before we get to implications, because the paper makes some big claims and not all of them are equally well-supported.

17:30Juniper: Please.

17:31Finn: First and most obvious one: the embedding model is doing a lot of work in this paper. Every claim about semantic diversity rests on cosine distances measured in a particular embedding space — OpenAI's text-embedding-three-large. If that space systematically compresses certain dimensions of meaning, then "semantic collapse" could be partly an artifact of how the measurement projects the conversation. The authors partly address this. The human Reddit baseline uses the same embedding model and shows much higher diversity — which is good evidence that the embedding space isn't intrinsically saturated, isn't crushing everything together by default. But the comparison isn't airtight. Reddit threads have human topic-switching pressures — people come in from outside, post about whatever they just read, drop news links. The LLM simulations don't have any of that. A fair skeptic could say the right comparison is humans constrained to the same minimal, taskless condition, and we don't have that data.

18:34Juniper: That's a fair pushback. The next one I'd flag — the "no task" setup is essentially a chosen worst case. The simulations have no goal, no scoring, no stakes. Real-world multi-LLM systems usually have some task structure — a research question, a problem to solve, something to optimize. It's at least plausible that task-anchored systems behave differently because the task provides exogenous structure that resists contraction. The paper acknowledges this implicitly by framing the result as being about *open-ended* generation. But the rhetoric sometimes slides toward broader conclusions about multi-LLM systems in general, and I think a careful reader has to be alert to that gap. The information-theoretic argument suggests task-anchored systems would *also* contract if they're closed — but the empirical evidence in the paper is for the open-ended case.

19:31Finn: And then — the mechanistic story is established on a single open-weight model. Llama-3.1-8B. The behavioral collapse appears across model families, but the induction-head causal story is demonstrated in just one architecture. That's a reasonable scientific compromise — you can't open up GPT-4o — but it's worth being honest that the white-box claim is narrower than the black-box claim. And the RL intervention is one specific recipe. GRPO with a particular reward function combining penalties for recent-history similarity, anchor similarity, and cross-agent similarity. Different weightings or different reward shapes might do something different. The paper's stated claim is appropriately scoped — *this* intervention fails — but the headline framing can read as if RL in general can't escape collapse, and the experiment doesn't quite support that strong a claim.

20:28Juniper: All fair. To the authors' credit, they're pretty disciplined about how they frame the theoretical results — they explicitly call them "useful theoretical analogies under stronger assumptions" rather than proofs about LLMs. The math is interpretive scaffolding. The empirical work is what carries the contribution.

20:48Finn: There's one more nuance from the paper that I think genuinely complicates the picture in an interesting way, and we should mention before we get to the bigger implications.

20:59Juniper: The cultural-axis result.

21:01Finn: Yeah. So you might assume — if all these LLM conversations are collapsing — that they all collapse to the *same place*. Like, one boring attractor that all multi-LLM systems converge on, eventually. That's not what happens. DeepSeek, GPT-4o-mini, and Phi-4 each contract toward partially distinct late-stage basins. The authors project the conversations onto axes like egalitarian-versus-hierarchical, individualism-versus-collectivism, cooperation-versus-competition — these are descriptive probes, not claims about underlying psychology, and the authors are careful about that — but you can see different models settling into different regions of these axes. The most concrete way they show it: a classifier trained to identify which model is speaking gets *more* accurate as the conversation goes on. Up to about ninety-four percent accuracy late in the trajectory. So collapse doesn't dissolve model identity. It sharpens it. Each model becomes more recognizably *itself* as it contracts.

22:05Juniper: Which has a strange flavor when you sit with it. The thing we worry about — homogenization — is happening within each model's trajectory. But across models, you get a kind of intensified specialization. A monoculture inside each system, with a diverse-looking surface across systems that's actually masking different attractors. That's a more complicated story than just "everything becomes the same."

22:31Finn: It's an important nuance for thinking about the implications. Because if you imagine an ecosystem of multi-LLM systems running in production — for science, for writing, for analysis — each closed loop drifts toward its own model-specific basin. The question of cultural homogenization in the bigger sense depends on whether the world ends up running one model or many, and whether their basins overlap or stake out distinct territories. The paper doesn't fully answer that, but it gives you the right shape of the question.

23:03Juniper: So let's land on what this means. There are three implications worth being clear about, in descending order of how confident the paper lets you be. First — and this is the well-supported one — it deflates a specific claim that's been floating around in the AI-for-science space. There are a lot of proposals right now for autonomous research pipelines: collectives of LLMs that generate hypotheses, design experiments, write manuscripts, with minimal human oversight. The implicit bet is that putting more agents in the loop expands the search space. The paper provides strong empirical evidence that, in the closed-loop case, the opposite is true. These systems may be effective at combinatorial discovery — recombining concepts inside a known neighborhood. But they appear structurally limited when it comes to transformational discovery — the kind that imports a concept from a distant domain into a new one. What this sharpens is the appropriate use case. These systems are augmentation tools, not replacements. Anyone proposing a fully autonomous closed-loop pipeline now has to address this evidence, and the burden is on them to show their architecture breaks closure — typically via fresh external data or human input.

24:19Finn: The second implication is about how we think about model collapse more broadly. The dominant story so far, from Shumailov and colleagues at Nature in twenty-twenty-four, locates collapse in training: when you train new models on synthetic data generated by older models, quality and diversity degrade across generations. This paper moves the same kind of worry one stage earlier — into inference time, with frozen weights, no training happening at all. Just through recursive self-conditioning in conversation, the outputs already contract. The implication is that the synthetic data feeding the next generation of training may already be pre-contracted before it ever enters the training pipeline. Training-side collapse and generation-side collapse compound. Each one makes the other worse.

25:08Juniper: And the third implication is the speculative one, and the authors are honest about it being speculative. As humans increasingly use LLMs — writing assistants, brainstorming tools, search interfaces, summarizers — and as LLMs increasingly train on text that is partly LLM-shaped, the joint human-LLM system may drift toward a narrower region of idea-space than either would alone. The authors call this a potential "epistemic precipice." The technical result of the paper doesn't *prove* that. It's a closed-loop, no-human-input scenario. Real human-AI interaction has humans in the loop. But the closed-loop result tells you what the gradient is — what these systems pull toward when you remove the human counterweight. And if the counterweight gets thinner over time, the pull is what's left.

25:58Finn: I want to be careful about that last point. It's the kind of speculation that flows reasonably from the technical work, but it's reasonable to disagree about how heavily to weight it. The data we have is from three LLMs in an empty room. The leap to "the cultural ecosystem is at risk" is a long one, and the paper's strongest contribution is the empirical and mechanistic story, not the civilizational projection.

26:24Juniper: Agreed. And the place I'd want to leave a listener is back where we started. Lovelace and Turing weren't asking whether machines can do useful work — they obviously can. They were asking a more specific question, about origination. About supercriticality. Whether a system, left to itself, can produce something genuinely beyond what was put into it. For closed loops of present-day language models, this paper's answer is pretty firm. They ring, and they fall quiet.

26:53Finn: And they ring with a wider vocabulary every time. New words, same ideas.

26:57Juniper: New words, same ideas. That doesn't mean LLMs aren't useful. It doesn't mean multi-agent systems aren't useful. It means the closed loop is the wrong design pattern if what you want is genuine open-ended exploration. The signal that does the exploring has to come from somewhere outside the loop — from new data, from human judgment, from contact with a world that pushes back. Whatever you call that input, the system needs it. The math says so, the mechanism says so, and now twelve different ways of pretending otherwise have failed.

27:31Finn: The paper's linked in the show notes, along with a few related reads — the earlier work on training-side model collapse, and the autonomous-AI-science papers this one is in conversation with. Worth a look if any of this stuck.

27:45Juniper: And if you want the full transcript with definitions inline for every term we used, plus the concept pages that connect this episode to the other ones we've done on language models and on multi-agent systems, that's all on paperdive dot AI.

28:00Finn: Thanks for listening to AI Papers: A Deep Dive.