When Better Fine-Tuning Can't Help: A Geometric Impossibility in LLM Causal Reasoning

0:00Cassidy: Here's a result that should make you stop. A fine-tuned RoBERTa, trained on more than a million examples of a causal reasoning task, gets 35 percent on the hardest version of the test. Random guessing would be 50. The model isn't just plateauing — it's confidently picking the wrong answer.

0:21Eric: And the paper explaining why went up on arXiv on May twenty-sixth, twenty-twenty-six. We're recording two days later. Quick ground rules: this is an AI-generated deep dive. The script is from Anthropic's Claude Opus 4.7. I'm Eric, that's Cassidy, and we're both AI voices from Eleven Labs. The producer has no affiliation with either company. The paper is "Why LLMs Fail at Causal Discovery and How Interventional Agents Escape," from Amartya Roy at IIT Delhi and Robert Bosch, and Sonali Parbhoo at Imperial College London — and the reason that 35-percent number matters is that it isn't a tuning problem. The authors prove it's a geometric impossibility.

1:10Cassidy: Right — and "prove" is the strong word in that sentence. The paper has two halves. There's a negative result: a clean impossibility theorem showing that any LLM trained the way LLMs are normally trained cannot reliably do this task. And there's a constructive escape: a method that takes the same frozen model — no new weights, no fine-tuning — and pushes accuracy from about 27 to about 73 just by changing how you ask the question. Let me set up the puzzle first, because the puzzle is genuinely concrete. Imagine three variables: A, B, C. Two possible stories for why they're correlated. Story one is a chain — A causes B, B causes C. Story two is a fork — there's a common cause B that drives both A and C separately. Now, here's the thing that makes causal discovery hard. Those two stories make exactly the same statistical predictions. A and C are correlated in both. Condition on B, and the correlation vanishes in both. From observation alone — no matter how much data you collect — you cannot distinguish a chain from a fork.

2:27Eric: And you can, if you're willing to act instead of just watch. If you reach in and force A to take some value — what causal inference people call do(A) — then under the chain, C changes. Under the fork, C is untouched. One intervention, one bit of information, and the ambiguity collapses. That's the entire engine of the paper. Observation can't separate them. Intervention can.

2:54Cassidy: So the benchmark that started this line of work is called Corr2Cause. Came out of Jin and colleagues in 2023. It hands the LLM an English paragraph describing the correlations and conditional independencies among some variables, plus a candidate causal claim, and asks: yes or no. The paragraph has enough information, in principle, to nail down what's called the Markov equivalence class — the set of graphs that all produce these correlations. But sometimes the right answer requires going beyond that class, and that's where things get interesting.

3:33Eric: GPT-4 scores about 29 on this thing zero-shot. Fine-tuning helps a lot on the training distribution — you can get a fine-tuned LLaMA-7B up to 92 — but rename the variables, change the graph size, and it collapses. The new benchmark in this paper, Extended Corr2Cause, scales it up to 24 variables. That's where the fine-tuned models go below random.

3:57Cassidy: Below random in a specific way. Not "noisy around 50 percent." Confident, systematic, wrong. The model has learned a decision rule on small graphs that anti-correlates with the truth on big ones. So the natural question is: what's actually going on?

4:15Eric: And this is where the paper does something I haven't seen done well before. Most of the LLM-fails-at-X literature is empirical — here's a benchmark, here's a number, here's a clever prompt that helps a little. Roy and Parbhoo go theoretical. They argue that the three dominant ways LLMs get trained — supervised fine-tuning, DPO, and in-context learning — all produce what's called a kernel predictor. Cassidy, this is the hardest concept in the paper. Want to take it?

4:46Cassidy: Yeah. Think of a kernel predictor as a very sophisticated similarity matcher. When you give it a new input, it doesn't reason from first principles. It looks at the training examples it has seen, asks how similar the new input is to each of them, and produces an output that's roughly a weighted average — heavy weight on the examples that look most similar. The key property is smoothness. If two inputs look 99 percent alike to the similarity function, the outputs the model produces have to be close. There's a mathematical ceiling on how much the output can swing as the input swings. That ceiling is what the paper cashes in.

5:27Eric: So why is a giant transformer a kernel predictor? They're explicitly nonlinear. They have hundreds of billions of parameters. That's not what I think of when I hear "similarity matcher."

5:40Cassidy: Right, this is the move. There's a result from theoretical ML over the last several years — the Neural Tangent Kernel literature — that says: when you train a very wide network and the weights barely move from initialization, the network is "lazy," and in that regime it provably behaves like a kernel predictor with a specific kernel determined by the architecture. The paper imports that result and argues that SFT, DPO, and in-context learning all keep the model close enough to that regime that the kernel-predictor characterization holds. The details of how SFT and DPO and ICL each get there are technical, and the paper does the derivations in the appendix. The upshot is: three different-looking training paradigms, same mathematical creature underneath.

6:32Eric: OK. Accept the framing: the trained model behaves like a similarity matcher. Now what?

6:38Cassidy: Now you put it next to the causal hypotheses. Take that chain-versus-fork pair. The premise that describes the correlations — the paragraph the model reads — is identical for both graphs, because both produce the same correlations. The hypothesis text is the part that differs, and that part differs in maybe two or three tokens out of three hundred. At 24 variables, the competing hypotheses share more than 99 percent of their input. So now apply the kernel-predictor ceiling. The maximum gap a similarity-based model can put between two outputs is bounded by two things: how different the inputs look, and how big the model's weights are. If the inputs share 99 percent of their tokens, the gap is tiny — unless you let the weights grow. And the weights growing is exactly what "lazy regime" forbids. That's Theorem 1. The paper does it with a Cauchy-Schwarz argument, but the picture is just: similar inputs, bounded weights, bounded output gap.

7:43Eric: And the scaling goes in the wrong direction. As you add variables, the premise has to describe order-d-squared pairwise relationships. The text describing the correlations balloons. But the structural difference between two near-miss hypotheses stays roughly constant — a handful of tokens. The disagreement gets drowned in a sea of identical text. The cleanest analogy I have for this: two three-hundred-page legal contracts that are identical except for one clause buried on page 147 that flips "must" to "must not." A reader scoring documents by overall similarity rates them as essentially the same document. The disagreement is real and consequential, but it's a vanishing fraction of the text. That's what near-miss causal hypotheses look like to a kernel predictor.

8:36Cassidy: And the way to feel why this is structural and not a matter of effort: imagine a witness who identifies suspects by overall resemblance — height, build, clothes. You show them photographs of identical twins in matching outfits and ask: which one robbed the bank? The witness can't reliably answer. They aren't being lazy or stupid. Their tool is the wrong tool for this comparison. The fix isn't a better witness. The fix is fingerprints — a different kind of evidence that actually differs between the twins.

9:10Eric: Which is exactly the move the second half of the paper makes. If the obstruction lives inside the model's representation space — the space where everything looks similar — then move the decision out of that space. Don't ask the model the global question, "which graph is it?" Ask it a sequence of local questions whose answers actually differ between the graphs, and combine those answers in an external loop.

9:39Cassidy: And the local question is exactly the chain-versus-fork distinction we started with. "If we intervene and set X1 to some value, does X3 change?" Under the chain, yes. Under the fork, no. Different graphs give different answers — not 99-percent-similar answers, just yes versus no. The kernel similarity between "yes" and "no" responses doesn't degrade with graph size. The obstruction simply isn't there for this question.

10:09Eric: They call the method A-CBO — Agentic Causal Bayesian Optimization. Mechanically, picture a game show with a panel of suspects. Phase one: ask the LLM to propose a set of candidate graphs that are consistent with the premise. Maybe sixteen of them. Phase two: in each round, figure out which intervention question would split the surviving candidates most evenly — this is done with classical information theory, no LLM needed. Ask the LLM that one question. Take the answer, with majority voting over a few samples for noise robustness, and update the posterior over the candidates using Bayes' rule. Repeat.

10:53Cassidy: And the convergence is fast. Each round, on average, multiplies the correct candidate's weight by a factor greater than one relative to the wrong ones. So belief concentrates geometrically — each round shrinks the live hypothesis set by a constant factor, with a noisy oracle in the loop. The theorem says you need roughly log-n rounds, where n is the number of candidates. In practice, the posterior collapses onto the correct graph within 8 to 12 rounds across the graph sizes they test.

11:28Eric: That's the empirical headline — the round count stays in that 8 to 12 range whether the underlying graph has 7 variables or 24.

11:37Cassidy: Worth being careful about what the theorem actually says, though. The convergence rate depends on the number of candidate hypotheses n and on how reliable the LLM is as an oracle. It doesn't depend on the near-miss similarity that defeats the kernel predictor — the thing that was poisonous to the global question is simply irrelevant to the local questions. What the theorem does *not* directly say is that n is independent of graph size. In practice, the candidate set is generated by the LLM in phase one, and you'd expect that pool to grow with the graph. The fact that empirical round counts stay flat is a finding, not a corollary.

12:21Eric: Fair. The obstruction still doesn't get cleverer or smarter or scaled away. It gets relocated to a place where it doesn't exist — and empirically, the relocation holds up across graph sizes.

12:35Cassidy: Eric, the spoon-and-knife framing in the paper's context captures this well. If you're trying to slice a tomato with a spoon, sharpening the spoon doesn't help — it's the wrong geometry. The fix is to recognize that cutting and scooping are different operations and reach for a knife. A-CBO does the analogous thing. Instead of upgrading the LLM, it changes which operation the LLM performs. The model stops trying to *decide* and starts merely *answering* local questions. The decision happens elsewhere.

13:09Eric: So the obvious question: does this actually work?

13:13Cassidy: The numbers are striking enough that they deserve their own beat. Take a frozen LLaMA-7B. Zero-shot on the original Corr2Cause, it scores about 27. Wrap it in A-CBO — same model, same weights — and it scores about 73. A 45-point swing from architecture alone.

13:31Eric: And on the original benchmark, A-CBO with a high-tier frozen model — GLM-5.1 — matches a fine-tuned LLaMA that was trained on roughly about 200 thousand in-distribution examples. Zero training versus about 200 thousand examples. Same accuracy.

13:47Cassidy: The Extended benchmark is where the comparison gets brutal. The fine-tuned RoBERTa baseline we opened with — 35 percent on the 21-to-24 variable instances. Random is 50. The DPO version: about 43. Both fine-tuned models are below random. A-CBO with the same high-tier frozen model holds at about 80 percent at those depths. And the gap widens with graph size. The fine-tuned models get worse. A-CBO stays flat. The architectures are diverging.

14:18Eric: There's one detail in here that I think is the most interesting empirical finding in the paper, and it isn't actually the headline number. It's the *direction* of the fine-tuned collapse. A fine-tuned model on a hard task usually degrades toward noise. Accuracy drifts toward 50 percent on a binary problem because the model becomes uncertain. That's not what's happening. The fine-tuned models here are confident and wrong. They're scoring 35 percent because they've learned a decision rule that systematically points the wrong way at high d. The paper's interpretation — consistent with the kernel-predictor story — is that surface features that correlated with the right answer on small graphs start anti-correlating on large ones. The model isn't out of its depth. It's been trained to lean on a signal that flips sign as the problem scales.

15:15Cassidy: Eric, that's the part I keep coming back to, because it changes what "scaling helps" means in this context. The conventional fine-tuning failure mode is "the model didn't learn enough." This one is "the model learned something that systematically misleads it." Those are very different failure modes, and they call for very different responses.

15:39Eric: Right. And let me push on the paper's framing now, because the result is strong, but there are load-bearing assumptions that deserve airtime. The biggest one — and the authors flag this themselves — is the oracle reliability assumption. The convergence theorem assumes the LLM answers each interventional query correctly with probability above one half. The paper sets that error rate to 10 percent throughout. Comfortable margin. But nobody measures what the actual oracle error rate is on these queries, especially for low-tier models or hard graphs. If the error rate drifts toward one half as the queries get harder, the log-n convergence guarantee becomes vacuous. You can essentially import the failure mode into the noise term and declare victory. The paper acknowledges this — they say low-tier models "approach random accuracy on large graphs" — but they don't quantify it. And the experimental results show exactly this stratification: high-tier models do well in A-CBO, low-tier ones don't. That's circumstantial evidence that oracle reliability is doing real work.

16:51Cassidy: The second thing I'd push on is the gap between what's proven and what's demonstrated. The theorem is about kernel predictors. The empirical claim is about real LLMs. The NTK characterization of LLM training is theoretically standard, but whether real fine-tuning actually stays in the lazy regime is genuinely debated. Some empirical work suggests it doesn't. If practical SFT departs from lazy in important ways, the theorem still holds as a statement about a mathematical object, but its grip on the empirical phenomenon weakens. The right way to read the paper, I think, is: the impossibility theorem characterizes a clean obstruction in a stylized setting, and the empirical results show that something *shaped like* that obstruction is happening in practice. The shapes match. But you wouldn't say the theorem caused the experiments.

17:46Eric: And there's a question of what the benchmark actually measures. Corr2Cause and the extended version both feed the LLM an English description of statistical relationships and ask it to deduce structure. That's deductive reasoning over text, not statistical discovery from data. The authors address this and argue the task is legitimately causal discovery in the formal sense — defensible, but "reasoning about a structured natural-language description of a graph" and "discovering causal structure from observations" aren't quite the same activity. The gap is wider than the framing sometimes implies.

18:24Cassidy: A related point on the Extended benchmark specifically. The labels are all negative — the metric is whether the model correctly identifies that no valid causal relation holds. A model with a slight bias toward "no" could score well without doing the underlying reasoning. The fine-tuned baselines falling *below* random does argue against a trivial yes-bias story — they're not just defaulting one way — but A-CBO's numbers come from a benchmark with a particular structure, and "scaling to 24 variables" should be read with that in mind.

18:58Eric: One more. A-CBO assumes the candidate hypothesis set generated in phase one contains the true graph. If the LLM proposes sixteen candidates and the right one isn't in the set, the Bayesian loop concentrates on whichever wrong answer is closest to the data. The paper doesn't characterize that failure mode. For small graphs it probably doesn't matter. For 24 variables, the number of possible graphs is astronomical, and phase-one is essentially "ask the LLM to propose candidates." That's a place where the architecture could quietly fail in a way the empirical numbers wouldn't surface.

19:36Cassidy: Those are real. Let me steelman the paper a bit, because I think the contribution survives them. The impossibility theorem isn't trying to be a precise quantitative model of every fine-tuned LLM. It's trying to identify a class of tasks where the standard playbook — better data, better training, bigger model — provably cannot work, and to give that class a clean characterization. That's a useful piece of intellectual hygiene. It tells you where to stop pushing on a lever that isn't moving. And A-CBO is really a design pattern, not just an algorithm. The lesson: when your underlying model is a sophisticated similarity matcher, and the discrimination you need can't fit inside the similarity geometry, move the discrete decision out of the model and into an external loop that uses the model as an oracle for local queries. That pattern almost certainly applies beyond causal discovery — anywhere a frozen LLM is asked to make a global judgment between hypotheses that look textually similar.

20:41Eric: Yeah. That's the part I think will outlive the specific result. The framing of "stop asking the LLM to be the judge" is the most portable piece of the paper. Even if every theorem here gets refined or replaced, that architectural prescription stays useful.

20:58Cassidy: There's also something I appreciate about the methodological discipline. The paper does a clean ablation. Same frozen model, ask it the global question zero-shot, then wrap it in A-CBO. Every model improves by 45 to 60 points. The weights didn't change. The decision architecture changed. That's about as clean as ablation evidence gets in this space.

21:21Eric: So where does this land? The impossibility result is structural. For distinguishing observationally equivalent causal hypotheses from text descriptions, the standard tools — better fine-tuning, more data, scale — don't help. Worth knowing, because a lot of effort gets poured into approaches the theory says can't work. The constructive fix generalizes. It's not really a causal-discovery algorithm. It's a way of decomposing a hard global decision into easy local queries and letting a frozen LLM serve as a per-query oracle inside an external decision loop. Broadly applicable. And — the one I keep coming back to — fine-tuned models going below random on hard instances is not a soft failure mode. It's a model that has learned a decision boundary that systematically misleads on the hardest cases. If you care about reliability at scale in any domain where the input distribution shifts toward harder instances, that should change how you evaluate fine-tuned systems.

22:28Cassidy: The practical impact depends on whether the architectural lesson translates outside synthetic benchmarks. Real causal discovery from noisy data has messier oracles and more ambiguous interventions. The authors are clear that they only evaluate on synthetic textual premises about toy graphs. So what's proven is a clean impossibility in a stylized setting, and what's demonstrated is impressive numbers on a benchmark of structured English descriptions about toy graphs. Both are real contributions. They aren't the same as "LLMs can now do causal discovery in medicine." The paper, in places, leans toward that grander framing. The result doesn't quite carry it yet.

23:15Eric: What I find compelling, though, is that the paper doesn't just say "LLMs are bad at this." It says: here's exactly what's bad about it, here's a proof, and here's the architectural change that follows from the proof. That's a higher standard of argument than most of this literature reaches.

23:36Cassidy: The cleanest way to summarize the whole thing: when the model is a similarity matcher, ask it questions that don't require it to distinguish things that look the same. Chain versus fork. Identical correlations. One intervention. One bit. Done.

23:53Eric: Paper's linked in the show notes, along with a few related reads if you want to keep pulling on this thread.

24:00Cassidy: And if you want the full transcript with the technical terms tappable and definitions inline, plus the concept pages that connect this episode to the others we've done, that's all on paperdive dot AI.

24:13Eric: Thanks for listening to AI Papers: A Deep Dive.