A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking

0:00Bella: Here's a thing about how a transformer thinks that I find genuinely strange. Every time it writes a word, it builds up this elaborate internal representation — sixty-two layers of computation stacked on top of each other, working out what to say next. And then the moment that word lands on the page, it throws all of that away. Starts over. Builds a fresh tower of thought from scratch for the next word, and the next, and the next. The words persist. The thinking doesn't.

0:31Finn: And the contrast the paper opens with is biological. The brain has a language network and what neuroscientists call a multiple-demand network — the part that does math, planning, novel problem-solving — and they're functionally separable. The thinking substrate doesn't reset every time you finish saying a word. It runs continuously underneath the speech.

0:54Bella: Right. So the puzzle the paper picks up is: what would a transformer look like if its layers could remember what they were just thinking? Not the words it produced — the working state. And the paper we're digging into is "State Stream Transformer V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning," by Thea Aviss at a small outfit called Fifth Dimension. It went up on arXiv at the end of April, twenty-twenty-six, and we're recording about a week and a half later. Quick note before we go any further — this is an AI-generated podcast. The script came from Anthropic's Claude Opus 4.7. I'm Bella, and Finn and I are AI voices from Eleven Labs. The producer isn't affiliated with either company. With that out of the way — the answer the paper offers turns out to be unusually concrete.

1:47Finn: And small enough to describe in one breath. At every layer of the transformer, you keep one extra vector around — a kind of sticky note. It records what that layer's feedforward network produced at the previous token position. When you move to the next token, before that layer's feedforward network fires again, you blend the sticky note back into the current hidden state. Tiny mixture. Like ninety-seven percent of what the layer just received from below, three percent of what it was thinking last time. That's it. Sixty-two layers, each whispering to itself across positions.

2:23Bella: The blend strength is learned per dimension, but it ends up around two-point-seven percent. Which sounds laughably small. But it compounds — every layer is doing it, every position is doing it, and the influence cascades through the stack.

2:38Finn: There's a useful framing the paper uses for what this buys you. Two axes of computation. The vertical axis is depth — how many times you run the layer stack at a single token position, kind of "thinking harder" before committing to a word. People have been exploring that for years. Universal Transformers, recurrent-depth models. The horizontal axis is persistence — letting state flow between token positions instead of rebuilding it. Mamba lives there, but pays for it by ripping out attention and using linear dynamics. Nobody had really combined the two axes inside a normal autoregressive transformer.

3:16Bella: And the same mechanism gives you both?

3:19Finn: That's what's nice about it. The sticky note is just "what this layer was last thinking." If you're stepping forward to a new token, "last thinking" means the previous position. If you're running the stack a second time on the same token, deliberating, "last thinking" means the previous iteration. One memory, two interpretations. And here, Bella — the analogy I keep reaching for is the difference between writing a longer essay and re-reading the same paragraph multiple times before deciding the next word. Chain-of-thought is the first. Iteration depth is the second. Both are "more thinking." But one produces more output, and the other does more processing per output.

3:59Bella: That distinction matters, because it's the one most likely to get conflated. Iteration depth isn't a longer answer. It's the same answer, with more deliberation behind each token before it's emitted.

4:11Finn: Now, the part of the paper that I think actually deserves a careful walk-through is the training trick. Because what Bella just described — a layer remembering what it produced at the previous position — that's a recurrence. And cross-position recurrences inside the layer stack are the thing transformers were specifically designed to avoid.

4:32Bella: RNNs were sequential by construction. Step a hundred had to wait for step ninety-nine. That's why they were slow.

4:39Finn: Right. Transformers won partly by replacing that with attention, which is fully parallelizable. So if you reintroduce a recurrence, you'd seem to forfeit the entire reason transformers are fast to train. And the standard escape hatches don't apply. State-space models can use a parallel-prefix-sum trick called an associative scan — but only because their recurrence is linear. The SST's recurrence passes through a feedforward network. It's nonlinear.

5:06Bella: Implicit fixed-point methods need the recurrence to be contractive, which this isn't. Newton-based parallel solvers need sparse Jacobians, which dense gated FFNs don't have. So all the existing tools fail. And the obvious fallback — unroll the sequence and backprop through time — is computationally a non-starter at any reasonable length.

5:29Finn: So what does Aviss do?

5:30Bella: She notices that supervised fine-tuning already accepts an approximation. When you train a language model on next-token prediction, you don't actually let the model generate — you feed it the correct previous tokens and ask it to predict the next one. That's teacher forcing. It's a lie, technically. The real autoregressive process would condition on the model's own outputs. But everybody accepts this lie because it gives you parallelism, and it's small enough to recover from.

6:02Finn: And the move is: tolerate a second lie of the same character.

6:06Bella: Exactly. The two-pass scheme works like this. Pass one — run the transformer with the blend turned off. That's just a normal transformer forward pass, fully parallel, fast. You get each layer's feedforward output at every position. Then you do an associative scan — the same parallel-prefix-sum trick Mamba uses — to shift those pass-one outputs forward by one position. So now position t has, sitting next to it, a reasonable guess for what its sticky note would have looked like. Pass two — rerun the transformer with the blend turned on, using those scan-propagated states.

6:45Finn: And the loss is computed on pass two, but gradients flow through both passes.

6:50Bella: Which is the part that makes it actually work. Pass one is the wrong distribution — the layers that generated those outputs didn't know they were going to be used as sticky notes. If you just plugged them in, you'd be feeding the model a slightly broken version of itself. But because gradients go all the way back, the model learns to produce pass-one outputs that are *useful* when they get propagated and blended into pass two. The two passes co-adapt. The lie becomes self-consistent.

7:22Finn: And how wrong is the approximation?

7:24Bella: This is the elegant part. The blend coefficient is around three percent. Pass one is wrong because it omits the blend, so its FFN inputs are off by roughly three percent. The feedforward plus residual is a bounded transformation, so its outputs are off by roughly three percent. Then pass two blends those off-by-three-percent outputs in, *with weight three percent*. Three percent of three percent is one part in a thousand. The error squares. So the whole thing is exact to first order in the blend coefficient. That's what makes it tractable.

8:00Finn: That is a nice piece of mathematics. Trade one approximation for parallelism, then trade another approximation of the same character for a different kind of parallelism, and at the end you've got a model that trains on a single GPU.

8:15Bella: A single NVIDIA RTX PRO 6000, in about six hours. That's the part of the paper that I think is going to make people pay attention regardless of what they think of the architecture. This isn't a frontier-lab moonshot. It's an intervention you could attempt on a Tuesday.

8:33Finn: And the results are loud. The author starts from a frozen Gemma three twenty-seven-billion-parameter model — a perfectly normal pretrained transformer — adds about three hundred and thirty thousand new parameters for the blend coefficients across all sixty-two layers, and fine-tunes on six thousand five hundred and seventy-nine grade-school math problems. G-S-M-eight-K. Then evaluates on, among other things, G-P-Q-A-Diamond, which is a benchmark of graduate-level science questions designed to be Google-proof.

9:07Bella: Trained on grade school. Evaluated on PhD.

9:10Finn: Trained on grade school, evaluated on PhD. The SST scores about sixty-one percent on G-P-Q-A-Diamond. A matched fine-tuned baseline — same data, same optimizer, same example ordering, only the architecture differs — scores about forty-six percent. The gap is fifteen points, attributable purely to the architectural mechanism.

9:32Bella: And that controlled comparison is the one that matters.

9:36Finn: There's also a more eye-catching number floating around — that this twenty-seven-billion-parameter model beats DeepSeek V3 at six hundred and seventy-one billion parameters, and several seventy-billion-plus open-weight models. That comparison is real, but the author flags it carefully: it's cross-paper, with different prompting conditions, different decoding strategies. The rigorous claim is the matched-baseline gap. The David-and-Goliath framing is contextualization, not a controlled win.

10:09Bella: Which I appreciate. The honesty there is unusual. Most papers would lean on the bigger number and bury the caveat.

10:16Finn: So the headline is: a small architectural change, trained cheaply, gives a fifteen-point reasoning gain on out-of-distribution PhD science questions. That alone would be a paper. But there's a second half that I think is actually more interesting, which is asking what the state stream is *doing* inside the model.

10:38Bella: This is the analytic contribution, and it's where the paper gets unusually rich. The state stream and attention are deeply entangled — you can't ablate one without affecting the other. So a clean ablation is impossible. Instead, Aviss compares the same model at different iteration depths. Same weights, same prompt, same architecture. The only varied factor is how many times the layer stack runs at each token position before emitting the next word.

11:08Finn: One re-read versus four re-reads of the same paragraph. Same model, deciding different numbers of times.

11:15Bella: Right. And the question is: how much does the hidden state *change* between iteration depth one and iteration depth four? You'd expect, naively, gradual evolution. The model thinks a little harder, the representation shifts a little. What you actually see is something completely different. Most positions have a hidden-state overlap above ninety-nine percent across iteration depths. Iteration barely moves them. They're stable. But at specific positions — content-dependent, different questions trigger them at different places — the overlap collapses to as low as thirty percent. The hidden state has reorganized. Completely.

11:56Finn: So it's not "a little more thinking, a little different answer." It's mostly nothing happens, occasionally everything changes.

12:05Bella: Eighty-six percent of position-layer pairs are stable. Fourteen percent are dramatically reorganized. Two clean modes, not a continuous spread. Aviss calls these moments *basin shifts*. The metaphor I find useful is a marble rolling across a landscape. At most points the marble is sitting comfortably in a valley, and a nudge — an extra iteration — barely moves it. At certain points it's balanced on a ridge between two valleys, and a small nudge tips it decisively into one or the other. Once it commits to a valley, every position downstream is shaped by that choice.

12:43Finn: Punctuated equilibrium. Long stable stretches and sudden reorganizations.

12:48Bella: Yes. And the reorganizations are causally prior to the model changing its mind about which word to emit. The hidden state shifts first. The argmax changes after. Never the other way around. Which is what tells you these aren't readout-noise events. These are commitments, in continuous latent space, that propagate.

13:10Finn: How big is the commitment, when it happens?

13:12Bella: At a basin shift, the median number of tokens replaced in the top hundred of the output distribution is fifty-four. Out of a hundred. The new winner can come from as deep as rank fifty in the original ranking. At a stable position, the median replacement is one. So these aren't the same process at different magnitudes. They're qualitatively different regimes.

13:37Finn: And the floating-point sanity check?

13:41Bella: Right — a skeptic would worry the iteration deltas are just bf16 rounding noise. Aviss verifies that the median per-dimension perturbation exceeds bf16 rounding error by fifteen times globally and twenty-nine times at the deepest layers. With p-values vanishingly small. Not noise.

14:00Finn: OK. So basin shifts are real, content-dependent, and causally upstream of the model's outputs. What follows from that?

14:08Bella: This is where the paper lands its most surprising finding. Across a hundred and ninety-eight G-P-Q-A-Diamond questions, every single one has a basin shift at position zero. The very first generated token. Before the model has written a single word of its answer.

14:26Finn: So the moment it picks up the pen, it's reorganizing.

14:31Bella: Every time. And the structure of that position-zero reorganization is different across questions. Some questions are ones where deeper iteration converges on the right answer — the model goes deeper and gets more confident in something correct. Others are ones where deeper iteration *breaks* an answer that was correct at iteration depth one. So-called overthinking failures. And here is the part I keep turning over. A small two-layer probe, trained to read the latent state at layer fifteen of position zero, can predict which kind of question it's looking at.

15:06Finn: Let me be careful about what that's saying. The probe doesn't make the model do anything. It just reads.

15:13Bella: Reads only. The probe is a passive observer of the model's internal state at the very first token. And from that read, it can predict whether the eventual answer — which the model has not yet started producing — will survive deeper iteration or break under it. The probe ends up reading a hundred and seven dimensions out of more than five thousand. About two percent of the hidden state. Scattered across the index range, not clustered.

15:40Finn: And it generalizes? Because there's an obvious memorization concern — maybe it just learned which specific questions are which.

15:48Bella: Aviss runs leave-one-out cross-validation against exactly that null. The probe is trained on forty-seven questions, tested on the held-out one, repeated for each of the forty-eight in the must-halt class. It correctly classifies twenty-nine of forty-eight held-outs. The probability of doing that by chance, if the probe were just memorizing, is about nine in ten thousand. There's also a capacity argument that there aren't enough probe parameters to memorize even if it wanted to.

16:18Finn: So the model contains, in its very first hidden state, information about a computation it hasn't performed yet.

16:25Bella: That's the claim. And to be careful — "knows in advance" is loaded language. What the experiment actually shows is that a statistical signature in the position-zero state correlates with downstream stability. Whether that's the model "predicting" anything in any forward-looking sense, or just exposing structural information about how the question is being represented — that's a deeper interpretive question. But the empirical fact is striking. The decision about whether the answer will hold up is, in some sense, made before any word has been said.

17:02Finn: And this matters because the overthinking effect is real and measurable. If you take the SST and force every question to flat iteration depth four — every position gets four passes through the layer stack — accuracy *drops*. From fifty-one percent at depth one to forty-three percent at flat depth four. Below the matched baseline. Thirty out of a hundred and one correct depth-one answers regress to wrong.

17:29Bella: Uniform compute is actively harmful.

17:31Finn: Right. Different questions need different amounts of thinking, and using the same amount everywhere is worse than using less everywhere. Which is exactly what the halting probe is for. You read position zero, decide whether more iteration will help or hurt, and route accordingly. Zero correct answers broken on the evaluation set. The four failures it does have are all conservative — halting too early on questions that were already wrong.

18:01Bella: Aviss has a nice line about this. Iterations should function as a per-question deliberation budget, not a blanket requirement for coherence. Which I think matters because the field has a habit of treating "more compute" as monotonically good. This work documents pretty cleanly that it isn't.

18:20Finn: Bella, let me push on a few things. Because this is a genuinely impressive paper and the author is unusually honest about limitations, but a careful listener should hear them voiced clearly.

18:32Bella: Go.

18:33Finn: First, the headline external comparison — beating DeepSeek V3 and a clutch of seventy-billion-plus models — is cross-paper. Different prompting, different decoding, different evaluation conditions. Cross-paper comparisons in this regime are notoriously unreliable. The matched-baseline result is the controlled science. The bigger-model framing is contextualization. The author flags this. We should too.

18:59Bella: Agreed. The matched-baseline gap is the experiment that's controlled. The other comparison is a banner, not a result.

19:06Finn: Second, the halting probe is small. Forty-eight must-halt questions on a single benchmark, with leave-one-out cross-validation rather than a held-out test set. The result is real. Generalizing it beyond G-P-Q-A-Diamond — to other domains, to non-reasoning tasks — requires more data. The author flags this as future work.

19:27Bella: And the iteration ceiling at four is somewhat arbitrary.

19:30Finn: Bounded by deployment practicality, not by where the architecture saturates. We don't know whether deeper iteration would extend the gains, plateau, or regress further. Given the overthinking pattern, regression is plausible.

19:45Bella: What about the approximation quality of the two-pass training itself?

19:49Finn: That's the one I think a careful methodologist would push hardest on. The order-α-squared bound is a worst-case analytic argument under fixed-weight assumptions. The author argues co-adaptation tightens it further, and the downstream evaluation gap is the proof of the pudding. But there's no direct measurement of how training with the approximation compares to the intractable exact version on a small enough toy. The bound is asserted; the gap to exact training is not measured.

20:20Bella: And single backbone, single benchmark family.

20:23Finn: Demonstrated only on Gemma three twenty-seven-billion. Trained on grade-school math. Evaluated heavily on math and reasoning benchmarks. The mathematical argument applies to any transformer with a Lipschitz gated FFN — meaning most modern models — but empirical universality is unverified. Whether the state stream helps creative writing, factual recall, instruction following, or multilingual tasks is genuinely unknown and explicitly listed as future work.

20:53Bella: So the steelman is: small training set, single architecture tested, evaluation tilted toward the regime the method should help, approximation bounded rather than measured, halting probe at proof-of-concept scale.

21:06Finn: That's the steelman. None of which, to be clear, undermines the central result. The matched-baseline gap is real, the basin-shift analysis is mechanistically illuminating, and the position-zero finding is genuinely surprising regardless of how it generalizes. But the headline numbers want some caveats sitting next to them.

21:28Bella: For me, Finn, the part that lands is the reframe of where reasoning compute could come from. The default story of how to make models smarter has been: scale parameters, or scale tokens — chain-of-thought. Both expensive. This paper sketches a third axis. Latent compute. Run the same parameters more thoughtfully, with internal state that persists, and you might get reasoning gains without growing the model or growing the output.

21:56Finn: And the cost structure is different. Three hundred and thirty thousand new parameters. Six thousand five hundred and seventy-nine training examples. Six hours on one GPU. Whatever else this turns out to be, it's not a frontier-lab-only intervention. If it replicates on other backbones, a lot of people are going to try it.

22:17Bella: The other thing I keep coming back to is the basin-shift framework as a handle on what "thinking in latent space" actually looks like. Before this paper, "the model is reasoning continuously in its hidden states" was a metaphor. After this paper, it's something with statistical signatures — qualitatively distinct stable and shift regimes, content-dependent triggering, causal precedence over output changes. That's a measurable thing, not a fuzzy one.

22:45Finn: And the position-zero finding is the part I expect to spawn follow-ups. Whatever the model is doing when it picks up the pen, it's doing more of it than we knew. The decision about whether what it's about to say will hold up isn't made when the words come out. It's made earlier. In a representation we can't see directly, but, evidently, can probe.

23:07Bella: The show notes have a link to the paper and some related reading if you want to pull on this thread further.

23:14Finn: Thanks for listening to AI Papers: A Deep Dive.