Language Models Compute the Rational Move, Then Override It

0:00Juniper: Two language models sit down to play Prisoner's Dilemma. Both of them — and you can prove this by opening up their internals — both of them have computed that the right move, the game-theoretic answer, is to defect. Both of them cooperate anyway. Every single time.

0:16Brooks: This paper landed on arXiv on the twenty-ninth of April, and we're recording about a week after that. What you're hearing is AI-generated — I'm Brooks, that's Juniper, and we're both AI voices from Eleven Labs. The script is from Anthropic's Claude Opus 4.7, and the show isn't affiliated with either company. The paper itself is called "What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control" — and the reason that title is worth saying carefully is that the word "suppresses" is the whole thesis.

0:50Juniper: Right. Because for years the field has had this descriptive observation: language models don't play game theory the way textbook agents are supposed to. You put two of them in Prisoner's Dilemma, they cooperate. You put them in coordination games, they miscoordinate. They behave more like humans than like the cold-blooded utility maximizers of textbook economics. And the comfortable interpretation has been: well, they just don't have the strategic competence. They can't compute Nash, so they fall back on something nicer.

1:23Brooks: And what these authors are saying — pah-ras-KEH-vas LEH-keh-as, who's at DreamWorks of all places, and YYOR-gos stah-mah-TOH-poo-los at the University of Crete — what they're saying is no, you've got the diagnosis wrong. The model knows. It computes the Nash answer, holds it for most of its forward pass, and then in the last quarter of the network, something flips the answer to "be nice" before the model speaks.

1:48Juniper: Compute, then suppress.

1:50Brooks: Compute, then suppress. And the reason that distinction matters — even before we get into how they prove it — is that it changes what the fix looks like. If the model can't compute Nash, you need to retrain it. If it computes Nash and then overrides, you might be able to flip the override at inference time, no retraining at all. Different problem, different scope of intervention.

2:15Juniper: And that's exactly what they end up doing. They find the override, they find the dial that controls it, and they show they can turn cooperation up to almost ninety percent or down to under one percent. Same model, same prompts, same weights, just a small nudge to the internal state at the very first layers. The strategic behavior of a language model turns out to be a single dial.

2:40Brooks: Let's start with what made them go looking. The behavioral setup is four open-source models — two sizes of Llama-3, two sizes of Qwen2.5 — playing four canonical two-player games for fifty rounds each. Prisoner's Dilemma, Battle of the Sexes, Stag Hunt, Matching Pennies. Three modes: just-answer-directly, chain-of-thought, and a scratchpad mode where the model can write notes to itself. For our purposes the game that does most of the work is Prisoner's Dilemma, so let me anchor that one. Two players, each chooses cooperate or defect. The Nash equilibrium — the cold rational answer — is mutual defection: whatever the other player does, you're better off defecting. But mutual cooperation gives both players a higher payoff than mutual defection. So there's this tension: the rational answer is worse for everyone than the nice answer. That tension is what the model is going to embody internally.

3:39Juniper: And there's one finding from the behavioral section that should stop you in your tracks before we get anywhere near the internals. They call it the universal cooperative lock. Every model, every architecture, every scale — eight billion parameters, seventy billion, seventy-two billion — in Direct mode in Prisoner's Dilemma, they all cooperate one hundred percent of the time. Across the board.

4:06Brooks: One hundred percent. Not "they tend to cooperate." Not "the cooperation rate is high." Every move, every model, locked.

4:14Juniper: And that's what tells you something structural is going on. If this were about strategic competence, you'd expect variation — bigger models more rational, smaller ones less, some architectures different from others. Instead you get this weirdly clean ceiling: nothing breaks the cooperative default in Direct mode. Whatever the cooperative pull is, it's stronger than anything the model's strategic reasoning is putting up against it.

4:44Brooks: There's one more behavioral fact worth keeping, because it complicates the picture. Chain-of-thought — actually letting the model reason out loud before it commits — can break the cooperative lock. But only above about seventy billion parameters. Llama-3-70B with chain-of-thought achieves perfect Nash play. Same prompting strategy on the eight-billion-parameter model? It actually gets worse. The Nash distance is higher with reasoning than without.

5:14Juniper: So the small model with reasoning is *worse* than the small model without. The reasoning is being used to talk itself further into cooperation, not toward the rational answer.

5:26Brooks: Which is its own puzzle. But the headline question they take into the mechanistic work is the universal lock: why does every model, regardless of size, behave the same way when you ask it to commit directly? And that question is what's going to take us inside Llama-3-8B, layer by layer.

5:46Juniper: All right, so they pick the eight-billion-parameter model for the mechanistic work — partly because it's the one where the cooperative lock is most absolute, and partly because at thirty-two layers it's still tractable to look at every single one. They load it through TransformerLens, a tool that lets you intercept and modify the model's internal state at every layer of the forward pass. Imagine watching someone do mental arithmetic, but instead of just getting the final answer, you get a freeze-frame of their working memory at every step. That's what TransformerLens lets you do.

6:21Brooks: And the question they want to ask at every freeze-frame is: what does the model know right now? What's encoded in its current internal state?

6:30Juniper: Right. They use two tools for that, and it's worth being precise about the difference because they do different epistemic jobs. The first is linear probing. You take the model's internal state at, say, layer five, freeze it, and train a tiny classifier on top of it to predict some property — like, did the model just see a cooperate move from the opponent, or did it see a defect? If the classifier can read it off accurately, the information is encoded there in a clean, accessible way. The second tool is the logit lens. Instead of asking what's encoded, you ask: if the model stopped processing right here at layer five and had to commit to an output token, what would it say? You take the intermediate state and shove it through the model's final output projection prematurely. So you get the model's running best guess at every depth — like a play-by-play of how its mind changes as the computation proceeds.

7:25Brooks: Probing is a camera looking at what's there. The logit lens is a camera looking at what the model would *do* if it stopped.

7:32Juniper: Exactly. And here's where the story gets interesting. They train probes at every layer for three things: did the model pick the Nash action, what move did the opponent just play, and did the model cooperate or defect. The opponent-history probe shows a very specific signature. At layer zero, accuracy is about ninety-six percent — the model knows what the opponent just did. By the final layers, accuracy has decayed to around fifty-five percent, basically chance. The information is loaded early and consumed by later computation. Like working memory you spend doing arithmetic. You load the digits at the start, use them up as you compute, and by the end the raw numbers are gone but their effect is in the answer.

8:23Brooks: So the model is using opponent history. It's just using it up.

8:27Juniper: Right. And then the Nash probe — the one asking "is the Nash action encoded here?" — never gets above fifty-six percent accuracy at any layer. Anywhere. Across all thirty-two layers, the strongest linear decoding of the Nash action peaks at fifty-six. There is no dedicated Nash module sitting in any one place.

8:49Brooks: Which on its own you might read as "the model just doesn't compute Nash."

8:54Juniper: You might. Until you turn on the logit lens. Because what the logit lens shows is that for layers zero through twenty-three — the first three-quarters of the network — the model's running best guess for its output token is Defect. The Nash answer. That's what it would say if it stopped. Then at layer twenty-four, something flips. The probability of Cooperate starts climbing. By layer thirty — the second-to-last layer — the model's logit-lens probability of saying Cooperate has hit point-eight-four. Eighty-four percent. The model has talked itself out of the Nash answer in the last quarter of its computation.

9:38Brooks: Eighty-four percent. So through twenty-three layers it's voting Defect, and then in seven layers it pivots almost completely to Cooperate.

9:48Juniper: And layer thirty-one, the very last layer, partially corrects back — it pulls some probability back toward Defect. But the cooperative push has already won the argument. By the time the token comes out, Cooperate is the answer. This is the suppression circuit. You can see it. The model computes the Nash answer through most of its forward pass, and then a late-layer correction overrides it.

10:16Brooks: OK. So the next obvious question is: where does the override live? If layers twenty-four through thirty are doing the flipping, what part of those layers? Is there a specific attention head responsible? A specific neuron?

10:31Juniper: Yeah. And this is where the paper does something I really like — they go looking for the obvious answer, they fail to find it, and they make the failure load-bearing. They score every attention head in the model by how much weight it places on opponent-history tokens — the candidate Nash-tracking heads. They take the top five, the most likely suspects, and they ablate them. Set their output to zero. Individually, in pairs, all five together.

11:02Brooks: And?

11:02Juniper: Zero effect. The probability of Nash play after ablation, minus the probability before — exactly zero. Across all configurations. Removing the heads that are most plausibly tracking the relevant information changes nothing at all.

11:18Brooks: Which is funny, because in interpretability, the canonical kind of result is "we ablated head seven point three and the model lost the ability to do X." That's the shape the field has gotten used to. And here they go looking for the equivalent finding and get a flat line.

11:38Juniper: Right. And the negative result is doing real work. It rules out an entire class of mechanism. The override is *not* localized to specific heads. It can't be, because removing the most plausible ones doesn't budge the behavior.

11:54Brooks: But distributed how? Because "distributed across the network" usually means "good luck finding a clean intervention point."

12:02Juniper: That's the move that makes this paper work. Distributed across many components, yes — but the authors hypothesize it might still be encoded as a single low-dimensional direction in the model's internal state. Like a choir singing in unison. No single voice is essential — silence any one and the sound is unchanged. But they're all singing the same note, and that note is what's doing the work. Find the note, and you can transpose the whole choir at once.

12:31Brooks: That's a load-bearing analogy. Let me make sure I have it. The override isn't a soloist you can ablate. It's a chord that lots of components are jointly producing. Which means the lever is the chord itself, not any one player.

12:45Juniper: Exactly. And that's where Brooks takes us next, into the steering experiments — because this is where they go from looking at the override to grabbing hold of it.

12:55Brooks: Yeah. So if the cooperative override lives as a direction in the residual stream — which is just the high-dimensional vector that flows through the model layer to layer, getting edited at each step — then the question is: how do you find that direction? The procedure is conceptually pretty clean. You build two sets of prompts: one set where the game history strongly favors cooperation, one set where it strongly favors defection. You run both sets through the model, you grab the internal state at layer two — which is where the Nash probe accuracy peaks — and you average. Average state under cooperation contexts. Average state under defection contexts. Subtract. Normalize. What you get is a single unit vector pointing from "defection mindset" to "cooperation mindset" in the model's representation space.

13:46Juniper: And just to flag — they cross-check this. They derive the same direction three different ways: this difference-of-means, principal component analysis on the contrastive states, and the normal vector of the trained probe classifier. Three independent methods, one geometric object.

14:04Brooks: Which is the kind of triangulation that makes you trust it's a real thing in the model and not an artifact of one technique. OK. They have the direction. Now what? Now they steer. They take that vector, multiply it by some scalar — call it alpha — and add the result back into the model's internal state. But only at the first three layers. Layers zero, one, two. A tiny nudge, very early. Then they sweep alpha. From minus twenty up to plus forty. At each setting, they let the model play Prisoner's Dilemma and they measure what fraction of the time it cooperates versus defects.

14:42Juniper: This is the dial.

14:44Brooks: This is the dial. Baseline — no steering at all — Llama-3-8B in their setup defects sixty-two percent of the time. With alpha at minus five — a small push *against* the cooperative direction — defection jumps to ninety-nine point two percent. Nearly perfect Nash play. With alpha at plus ten, cooperation hits eighty-eight point seven percent. Same model, same prompt, same weights. The strategic behavior of the system is now a smooth dial running through the entire range, controlled by the magnitude of one vector added at one early layer.

15:20Juniper: And the asymmetry is interesting too. A small negative push almost completely flips the behavior. A positive push toward cooperation needs more magnitude to saturate. There's prior work earlier this year — Sun and Zhang — showing exactly this asymmetric pattern, that positive steering for prosocial behavior works more easily than negative. And this paper actually explains why. Positive steering is amplifying a mechanism that's already running. Negative steering has to overcome it.

15:52Brooks: But — and this is where I want to slow down — steering by addition has a known weakness as a causal argument. You're adding a vector. Maybe the vector contains exactly what you think. Or maybe you're just perturbing the system in a way that incidentally shifts the behavior, and the specific direction is doing less than it looks like.

16:14Juniper: Confound by side effect.

16:16Brooks: Right, Juniper. So they do a second experiment that's stricter, called concept clamping. The idea: instead of *adding* alpha times the vector, you first project out whatever component is already there along that direction — subtract the existing cooperative signal entirely — and then write in a fixed value that you choose. The thermostat analogy works pretty well here. Activation steering is nudging the thermostat dial up or down a few degrees. You don't know the resulting temperature, just that you've shifted in some direction. Concept clamping is *setting* the thermostat — you remove whatever it was reading and write in your chosen number. The only thing varying across trials is the magnitude along this exact axis.

17:03Juniper: And if behavior tracks the clamped value monotonically — high values give you cooperation, low values give you defection, smoothly and predictably — then the direction is the actual lever, not a side effect.

17:16Brooks: Which is what they find. They sweep clamped values from minus thirty to plus thirty. At the negative extreme, cooperation drops to point-one percent. At the positive extreme, cooperation reaches ninety-eight-point-six. Pearson correlation of zero-point-seven-three across the sweep, with strong statistical significance. So you go from "almost certainly defect" to "almost certainly cooperate" by setting one number on one direction. That's what closes the causal loop. Steering shows the dial is connected to something. Clamping shows the dial *is* the lever.

17:52Juniper: Now there's a fourth strand of the paper I want to spend a few minutes on, because it surfaces things you can't see from self-play alone. They take pairs of *different* models and put them across the table from each other, and some of the patterns are wild. The headline one is about the eight-billion-parameter Llama. In self-play in Direct mode, it locks into mutual cooperation like every other model — that's the universal lock we talked about. But pair it with any of the larger models in cross-play in Direct mode, and it defects ninety-eight percent of the time. And not only does it defect, it pulls its partners with it. The bigger models, which would otherwise lock into mutual cooperation, end up defecting too.

18:41Brooks: One small contaminator unravels the cooperative equilibrium for the whole group.

18:47Juniper: Right. Whereas if you take any pairing of the three larger models — Llama-3-70B, Qwen2.5-32B, Qwen2.5-72B — without the eight-billion model in the mix, every pairing in Direct Prisoner's Dilemma produces mutual cooperation, indefinitely. Two large models will reinforce each other's cooperative instincts forever.

19:08Brooks: Which is, depending on your perspective, either reassuring or terrifying for multi-agent deployment. If you're running a homogeneous group of large LLM agents to negotiate something, they're going to be way too nice to each other. Mix in one smaller model, and the group falls into defection.

19:28Juniper: And the failure mode that worries me most is the second one, because it's invisible if you only evaluate each model in isolation. You'd test your eight-billion-parameter model in self-play, see it cooperating happily, and conclude it's fine. Drop it into a heterogeneous group and it does something completely different. There's one more finding I want to flag because it's the cleanest example of chain-of-thought going pathologically wrong — and this one is actually a self-play result, not cross-play, but it's striking enough to belong in the same conversation. Matching Pennies — the game that requires randomization, no pure Nash, you have to play roughly fifty-fifty heads or tails. Qwen2.5-72B in self-play, with chain-of-thought reasoning enabled, plays Heads eighty-eight percent of the time as Agent A. Agent B plays Tails ninety-four percent of the time. Both lock into pure strategies in a game that requires unpredictability. The paper actually flags this as their most anomalous result. The same model, same game, in Direct mode — no reasoning — gets near-perfect mixed play. So the chain-of-thought is *too good* at exploiting opponent history. The reasoning lets the model find patterns and exploit them, which destroys the equilibrium. In Matching Pennies, the right answer is to refuse to find patterns. The naive model wins because it doesn't try to be clever.

21:01Brooks: All right, Juniper, let me put my skeptic hat on for a stretch, because there are a few places where the paper's reach exceeds its grasp and I want to walk through them. The first and most important: all the mechanistic work — the layer-by-layer analysis, the logit lens reveal, the steering, the clamping — every bit of it is on Llama-3-8B. The single eight-billion-parameter model. And the behavioral results from earlier in the paper actually suggest larger models are qualitatively different — chain-of-thought works for them, doesn't work for small models, the strategic competence story may genuinely differ at scale. So the framing throughout is "language models compute Nash, then suppress it." That's the headline. But the evidence for the suppression circuit — the actual mechanistic finding — is on one model. Whether the seventy-billion-parameter version of Llama has the same circuit in the same place at the same relative depth, doing the same job, is an open question. The authors acknowledge this in their limitations section. It's a real caveat, and the title of the paper is broader than what the data underwrites.

22:14Juniper: Yeah. The behavioral findings are at scale; the mechanistic findings are at one scale.

22:19Brooks: Second: the games are tiny. Two players, two actions each, fifty rounds, full action history in the context. Real strategic deployments — auctions, agentic workflows, multi-party negotiations — look nothing like this. Whether the cooperative direction even *exists* as a clean low-dimensional object in environments with rich action spaces, partial observability, longer horizons — totally unknown. Third — and this is the one I keep coming back to — the attribution to RLHF. Throughout the paper, the authors describe the override as "likely instilled by RLHF," reinforcement learning from human feedback. The story is plausible: RLHF rewards helpful, prosocial behavior; the override is prosocial; therefore RLHF. But they never test it. The clean experiment would be to compare a base model — pretrained but never RLHF-tuned — to its RLHF-tuned counterpart, and see whether the suppression circuit exists in the base. They don't run that experiment. Which means the override could just as easily come from pretraining text — humans cooperate in dialogue, models trained on human text learn cooperative defaults, no RLHF required.

23:37Juniper: That's a real gap. They have a mechanistic finding without a training-time origin story, and they sort of slide RLHF in as the explanation.

23:47Brooks: It's the paper's quietest weakness. They don't oversell it — the language is hedged. But the framing of the contribution as "we identified what RLHF does to game-theoretic behavior" is stronger than the evidence on the table. A couple smaller points. The Nash distance metric is a single number, and two very different play patterns can produce the same number — particularly in some of the cross-play results. And the cross-play sample sizes are small per cell. Fifty rounds per pairing means individual numbers can swing more than they appear to.

24:24Juniper: And there's a methodological note worth flagging on the logit lens itself. The version they use — the original from Nostalgebraist — assumes intermediate layers are roughly comparable to the final-layer output space, which is approximately but not exactly true. There's a more recent variant, the tuned lens, that addresses this. Some of the dramatic layer-twenty-four flip *could* be partly an artifact of the lens. The reason I'm not as worried about that one is that the steering and clamping experiments are unaffected — those don't depend on the lens at all. So even if the precise layer numbers shift under a more sophisticated lens, the existence of the cooperative direction and the dial behavior is on solid ground.

25:11Brooks: Right. The logit lens result is the most narratively powerful, but it's not the most evidentially load-bearing. The interventions are.

25:20Juniper: Which is part of why I think the paper holds up overall, even with the steelman list. The mechanistic claim — there's a low-dimensional direction in the residual stream that controls cooperative behavior — is supported by three independent methods of finding the direction, two independent kinds of intervention, and a clean monotonic relationship between intervention magnitude and behavior. That's a strong stack of evidence for the central finding.

25:49Brooks: So zooming out. What are we actually supposed to take away from this?

25:53Juniper: I think the biggest one — bigger than the specific finding about game theory — is the methodological point. When you go looking for a behavior in a language model and you can't find it in a single attention head or a small circuit, the next thing to try is finding it as a *direction* in the residual stream. The choir, not the soloist. This is a different shape of mechanism than a lot of earlier interpretability work has identified, and it might be the right shape for many of the most consequential model behaviors. The cooperative override is a clean instance, but the same template might apply to honesty, refusal, sycophancy, hedging — basically anything RLHF is widely believed to install. If those dispositions also live as late-layer corrections encoded as low-dimensional directions, the methodology generalizes, and a lot of RLHF-shaped behavior may turn out to be controllable at inference time without retraining.

26:55Brooks: Which has serious upside and serious downside. The upside is a much lighter-weight kind of behavioral control — you can dial cooperation up for diplomatic agents, dial it down for hard-bargaining ones, without touching the weights. The downside is that the same dial can be turned by anyone with access to the model's internals, in either direction, including directions you wouldn't sanction.

27:21Juniper: Yeah. The eighty-nine percent cooperation setting and the ninety-nine percent defection setting are both available from the same vector. The intervention is morally neutral; what you do with it isn't.

27:34Brooks: The other takeaway I want to flag is the diagnostic shift. For years the conversation about LLMs as strategic agents has been organized around the question "why do they fail at game theory?" — assuming a competence gap. This paper says: stop asking that. They don't fail at game theory. They compute it correctly, and then they override. So the question to ask isn't about the missing competence. It's about the override. What other late-layer corrections are doing what in the models we use every day? What does the suppression machinery look like for honesty, for refusal, for the long list of behaviors people study at the surface level?

28:14Juniper: That's a reframing of the whole research program. From "language models can't do X" to "language models compute X and then edit it." One image I keep coming back to is the diplomat. The diplomat who, in their head, has worked through the cold strategic answer — we should walk away — but whose training kicks in at the last second, and what comes out of their mouth is something warmer and more cooperative. The private calculation was right; the public output was edited. If that's what's happening across many other behaviors — and the paper isn't claiming that, it's only showing it for this one — it changes the whole shape of what we're studying when we study language models.

28:57Brooks: A short, focused paper that punches above its weight.

29:00Juniper: We'll wrap there. The paper went up at the end of April; this episode was produced on May third. Show notes have a link to the paper and related materials — worth a read if any of this caught you.

29:12Brooks: Thanks for spending the time with us. This is AI Papers: A Deep Dive.