When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

0:00Juniper: Brazil, nineteen-eighty-six. The country is in the middle of one of the worst hyperinflations in recorded history — prices doubling, then doubling again, then doubling again. A few weeks ago, a research group fed that exact stretch of historical inflation data into Claude Opus 4.6 — the frontier model, the one you'd most trust to think carefully about a complicated situation — and asked it to forecast what came next. And the model did think carefully. It wrote out, in its own response — and I'm quoting almost verbatim — "hyperinflation could also stabilize through currency reform, adding downside uncertainty. But following the trend..." And then it produced a median forecast roughly seven million times above what actually happened.

0:50Finn: Seven million times. Not seven times — seven million.

0:54Juniper: Right. The model articulated the regime-change possibility, weighed it on the page, decided against it, and committed to extrapolating the exponential. That moment is, more or less, the whole paper compressed into one response. The paper went up on arXiv yesterday — May twenty-first, twenty-twenty-six — and we're recording the next morning. It's called "Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most," from Nick Merrill, Jaeho Lee, and Ezra Karger at the Forecasting Research Institute. What you're hearing is AI-generated. I'm Juniper, that's Finn, we're both AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. Neither company is involved in producing this show. And the reason the Opus moment matters is that it isn't a fluke. It turns out to be a pattern — one that shows up across epidemics, housing bubbles, and decades of measles data — but only if you grade the forecast the right way.

2:01Finn: And the "right way" part is where this paper does something unusually clever, because the headline finding by itself sounds almost folkloric — bigger models are worse forecasters of certain things — but the deeper claim is methodological. The standard way the LLM community grades forecasting benchmarks literally cannot see this failure. Same model outputs, same forecasts, opposite verdict depending on which scoring rule you apply. So I think we have to do the conceptual setup before we get to the empirical work, because otherwise the punchline doesn't land.

2:37Juniper: Agreed. Let me start with what a forecast even is in this paper. When you ask a weather app whether it'll rain tomorrow, you get a single number — 70 percent chance of rain. That's a point forecast of a probability. But when professionals forecast things like an epidemic curve or an inflation trajectory, they don't hand you one number. They hand you a distribution. They say: there's a 10 percent chance cases are below this floor, a 50 percent chance they're around this median, a 10 percent chance they're above this ceiling. The shape of that distribution — especially the upper tail, the ceiling — is the entire thing decision-makers actually use. A hospital system stress-tested at the ninetieth percentile is a different problem than the median forecast. The forecast is a range with a shape, not a dot.

3:29Finn: And the authors elicit that range from the model directly. They ask for five quantiles — basically the floor, the lower-middle, the median, the upper-middle, and the ceiling — so they can read off the entire shape of what the model believes. That detail will matter later, because the failure isn't in the median. The median is fine. The failure is in the ceiling.

3:53Juniper: Right. Now here's the load-bearing concept. Once you have a distributional forecast, you need a way to grade it. And there are two grading philosophies that matter for this paper. The first is what's called the Brier score. Brier picks a threshold — some line in the sand — and asks: did you put the right probability on the outcome being on the right side of that line? "Will COVID cases next month exceed ten thousand?" The model said 70 percent yes, the truth is yes, you get graded on how close 0.7 was to 1.0. One line, one yes-or-no, one number out.

4:31Finn: A single binary question with a confidence attached.

4:35Juniper: Exactly. The second grading philosophy is called CRPS — the Continuous Ranked Probability Score. And the cleanest way to understand it is: CRPS asks the Brier question at every possible threshold, and adds up all the penalties. So it doesn't grade you on one line. It grades you on the whole sweep. Five thousand cases? Ten thousand? A hundred thousand? A million? Every cutoff, all the way out, the entire predictive distribution, scored as a whole.

5:06Finn: Try the weather analogy here, because I think it'll click for people. Imagine two forecasters. The first one tells you a single number — 70 percent chance of rain tomorrow. The second hands you an entire map of possibilities: chance of light drizzle, chance of steady rain, chance of a thunderstorm, chance of a hundred-year superstorm, chance of clear skies. Brier grades the first one. CRPS grades the second one. And here's the move that is the paper: if forecaster two puts five percent probability on a hundred-year superstorm and it just rains normally, Brier doesn't notice. The threshold question "did it rain?" was right. The 70 percent was in the right neighborhood. But CRPS does notice. It sees that the second forecaster put weight way out in superstorm territory, and it dings them for it.

6:01Juniper: That asymmetry is the entire paper. A forecast can look completely fine to Brier and catastrophic to CRPS — if the upper tail is in the wrong place.

6:12Finn: And the natural follow-up is: why does the LLM community use the Brier-style metric instead of the full distribution one? The honest answer is convenience. The big LLM forecasting benchmarks — ForecastBench, KalshiBench, others — are built around questions that are naturally binary. "Will the Fed raise rates by year-end?" "Will candidate X win the election?" Yes or no, with a confidence. Brier was the obvious tool. And for political and economic event prediction it's actually fine. The problem is that the same scoring approach got carried over into time-series numerical forecasting, where the upper tail is precisely what you care about — and the metric can't see it.

6:58Juniper: So now we know what the grading rules are, and we know why one of them might hide a failure. Now the question is: is there actually a failure to hide? And this is where the paper builds, very deliberately, from clean to messy. The first thing they do is construct a brand-new forecasting benchmark from scratch, on a setting no model could possibly have seen during training. They use rollouts from the open-source empire-building game Freeciv — basically Civilization, but free — freeze each game at a particular turn, generate a natural-language report of the world state, and ask the model to forecast future quantities. How big will the treasury be a hundred turns from now? How much territory? How many cities? And because they control everything, they can ask the same question in two ways: a binary version, "will the treasury exceed its current value at turn one-eighty?", and a continuous version, "give me the actual distribution over treasury values at turn one-eighty." Same world. Same future. Two different question formats.

8:10Finn: And this is where the crack first opens, because when they line up twenty-eight models across seven providers — covering an entire range of capability from GPT-3.5 era models all the way through Opus 4.6, GPT-5.1, Gemini-3-pro, Grok-4 — and they ask "how does forecast accuracy correlate with model capability?" — they get two different answers depending on which format they're scoring. At short horizons, more capable models do better on both. Fine. At long horizons, on the binary questions, more capable models still do better. But on the continuous distributional questions, scored with CRPS, more capable models do worse. The rank correlation between capability and accuracy is positive on binary, negative on continuous. The most capable models are the worst distributional forecasters.

9:02Juniper: And this is where I think a reasonable listener should be skeptical. Because — Finn, the first time I read this section, my reaction was: this is a game. Maybe game data does weird things to models. Maybe the binary questions are just easier in some way the continuous ones aren't. Maybe the long horizons are noisy. This is one benchmark.

9:24Finn: The structure of the rest of the paper is essentially a list of those objections being knocked down one at a time. The next step is the cleanest experiment in the paper, and it's the one I'd point to if someone asked me what convinced me. They build a synthetic epidemic simulator. The structure is the standard one — there's a population, an infection spreads exponentially, and at some point a public health intervention kicks in and the outbreak peaks and crashes. They show the model the first sixty days of the rising phase and ask it to forecast what comes next, up to two hundred ten days out. The same inverse scaling appears. More capable models, worse forecasts.

10:08Juniper: And now comes the control that locks the mechanism in. They take that same epidemic structure, but they replace the exponential growth with linear growth. Same eventual crash, same downward jump, same intervention dynamics — just slow, straight-line growth on the way up instead of explosive growth. The inverse scaling vanishes. Completely. More capable models become better forecasters again, with a strong positive rank correlation.

10:38Finn: Which is the moment where the mechanism locks in. Crashes alone don't break capable models. Superlinear growth alone doesn't break them. It's specifically the combination. It's the bend-then-break shape.

10:53Juniper: And there's an analogy that I think captures it. Imagine a math student who's just learned to recognize geometric sequences. You show her 2, 4, 8, 16, 32, and ask her to predict the next ten terms. A weaker student hedges — guesses some high, some low, leaves herself wiggle room. The stronger student confidently extrapolates: 64, 128, 256, 512. If the sequence keeps doubling, the strong student wins by a mile. If the sequence suddenly resets — say the teacher was modeling something that hits a ceiling — the strong student is now off by factors of thousands. While the weaker student's hedging looks almost prescient. The strong student isn't making a mistake. She's correctly identifying the pattern and committing to it. The commitment is the liability.

11:44Finn: And the authors have a phrase for this that I want to make sure we use, because it's exactly right. They call it competence-driven overcommitment. The model isn't making errors. It's not failing to see the growth. It's seeing the growth more clearly than weaker models do, and trusting it more aggressively.

12:05Juniper: And we can verify that interpretation in the data, because the authors do a really clean decomposition. They look quantile by quantile at where the damage is happening. The lower tail of the forecast — the floor — stays basically flat as model capability goes up. Doesn't move much. But the upper tail — the ceiling — shifts dramatically upward with capability. More capable models put their ceiling higher, more aggressively, on these exponential-growth series. When the growth continues, the elevated ceiling tracks the outcome, and capable models look great. When the growth breaks, the ceiling sits far above what actually happens, and capable models get hammered.

12:50Finn: And this is exactly what the weather-map analogy was setting up earlier. The more capable models are putting more weight on the hundred-year superstorm outcome. That weight is invisible to a single-threshold Brier score, because the threshold doesn't care how far above you went, only that you went above. But CRPS integrates over every threshold, including the absurd ones way out in the tail. That's where the penalty lives.

13:18Juniper: And we should put a number on this, because the magnitudes are striking. On the synthetic crash regime, at the longest horizon, the most capable Llama variant they tested produces forecasts whose CRPS is in scientific-notation territory — for audio, the cleanest way to say it is: forecasts that miss by sixteen orders of magnitude. The predicted ceiling is in numbers nobody would write out loud.

13:44Finn: A forecast that's wrong by ten quadrillion times.

13:47Juniper: And the most capable, post-trained Llama variant produces forecasts that inflate CRPS by ten times or more compared to the base model on 63 percent of its outputs. Most of the time the post-trained, most-capable Llama is overshooting catastrophically.

14:04Finn: Which brings us to the within-family experiment, because at this point a careful reader is going to push back on the capability axis itself. The paper is using something called the Epoch Capabilities Index — basically an aggregate of how well a model does on standard benchmarks — to rank models. But that index is observational. It co-varies with release date, with training data, with which company built the model, with how aggressive their post-training was. Maybe the apparent inverse scaling is really just "newer models from companies with more aggressive post-training are worse at this," and "more capable" is a confound. So they run a clean controlled experiment within a single model family — Llama-3.1 — where they can manipulate two axes independently. Scale, which is 70 billion parameters versus 405 billion. And post-training, which means base model versus instruction-tuned chat model. Two by two, four conditions, same architecture, same training corpus.

15:07Juniper: And both axes make the problem worse, independently. Scaling up the base model makes it more overconfident on growth extrapolation. Adding post-training on top of either size makes it more overconfident. And the two effects compound. The biggest, most-tuned version is the worst.

15:26Finn: Which is, if you sit with it for a second, an unsettling result for the "just keep scaling" reflex. Because the implication isn't that scale is wrong. It's that scale plus current post-training methods, applied to this kind of forecasting problem, makes the failure mode worse. The fix isn't going to come from the next checkpoint.

15:47Juniper: Now — Finn, I want to hand it to you for what I think is the most credibility-building piece of evidence in the paper. The synthetic stuff is convincing for mechanism, the within-family experiment handles the confounding objection, but the question that's still open is: does this happen in the real world?

16:07Finn: Yeah, and I want to flag the meta-question first, because it matters for how to read the real-world evidence. Three of the four real-world domains they test — COVID-19 across sixty countries, the 2003-to-2006 housing bubble, twelve hyperinflation episodes — were chosen because the crash had already happened. That's a precondition for testing the mechanism, but it's also a selection effect. You can't infer "this is how LLMs fail in deployment" from a sample of cases where you already knew there'd be a crash to fail on. The authors are clear about that. And they handle it with what's, to my mind, the strongest piece of evidence in the paper. They go to the entire pre-vaccine US measles era. From nineteen-twenty-eight to nineteen-sixty-two. Every state, every season, no selection on severity. They didn't pick the dramatic outbreaks. They took everything. One thousand three hundred thirty-nine state-seasons.

17:05Juniper: And the inverse scaling shows up there too. Pre-registered, ex-ante, on a cohort the authors couldn't have curated for drama because they took the whole thing.

17:15Finn: That's the result that, for me, lifts this from "interesting finding" to "you should take this seriously." Because the obvious critique of every other replication — you cherry-picked the crashes — doesn't apply. They didn't cherry-pick. The mechanism shows up across decades of pre-vaccine measles data, on the routine majority of seasons where nothing dramatic happened, just as much as on the catastrophic ones. And the authors do one more thing that I really respect, which is they pre-specify an informative negative control. They predict, before testing, that flu shouldn't show the inversion. Because flu doesn't overshoot the way measles does. The most explosive historical flu epidemics only spike about three times above baseline — well below the threshold the mechanism needs. So they run flu through the same pipeline and no inversion appears. Capability and forecast quality stay positively correlated, basically as expected. Which is exactly the kind of falsification result that strengthens the positive result. If flu had shown the inversion, it would have meant the explanation was something else — something about disease data in general. The fact that it didn't tells you the trigger really is the superlinear-growth-plus-regime-change shape.

18:36Juniper: And then we get the moment that — I think this is the paper's biggest punchline, and they save it for the end deliberately. They go back to the original forecasts. Same five quantiles from each model. Same data. And they construct a Brier-style threshold score from those exact same outputs, at a natural cutoff. The sign of the capability-accuracy correlation flips. Same model outputs. Same numerical data. Opposite verdict. Under CRPS, more capable models are worse forecasters with a rank correlation of about negative point four. Under derived Brier from the same outputs, more capable models are better forecasters, rank correlation of plus point five.

19:18Finn: And that's the part that I think makes this paper a methodological event, not just an empirical one. Because every existing LLM forecasting benchmark — ForecastBench, KalshiBench, the others actively being used right now to certify whether language models are getting better at predicting the future — reports binary or threshold metrics only. On tasks with this structure, all of those benchmarks would tell you the most overconfident models are the best forecasters.

19:49Juniper: They'd certify the most overconfident models as state-of-the-art.

19:54Finn: Right. And the cost of that isn't hypothetical. There's active research applying LLMs to real-time epidemic forecasting — papers in journals like Nature Computational Science from twenty-twenty-five. Public health agencies forecast case trajectories every season to time interventions and allocate hospital resources. If the LLM-driven epidemic forecasts being built for that purpose overshoot during routine seasons and miss the truly catastrophic ones, the cost is mis-timed interventions. Real ones.

20:27Juniper: Now, before we get to the fix, I want to spend a moment on what I think is the most disquieting result in the paper. Because the paper does ask: does telling the model what it's forecasting help? You'd think a frontier model that knows it's looking at a hyperinflation should be able to use that knowledge. So they run a knowledge probe. They ask the models directly: "based on this sequence, what economic event are you looking at?" And the models correctly identify the hyperinflation crisis in forty-six out of forty-eight cases. The knowledge is there. The priors are recoverable. Then they go back to the forecasting prompt and they name the crisis explicitly. "This is the Brazilian hyperinflation of nineteen-eighty-five through nineteen-eighty-nine. Now forecast." And calibration on the upper tail doesn't improve. At all. The model still produces extreme overshoots.

21:26Finn: Which is genuinely strange. Because for COVID, naming the country and the date does rescue calibration. For housing, naming the market substantially attenuates the failure. But for hyperinflation, the knowledge is fully there in the model — verifiable — and it just doesn't propagate into the forecast tails. The authors are honest that they don't have a mechanistic explanation. They explicitly defer it to a future interpretability paper. But the gap is sitting there in the data: a model can articulate the regime change, identify the specific historical episode, and still produce a forecast off by seven orders of magnitude.

22:08Juniper: Which brings us back to where we started. The Opus 4.6 moment isn't an anomaly. It's a snapshot of a structural pattern. The model has the knowledge. It articulates the alternative. It chooses extrapolation anyway.

22:22Finn: And I think this is the right place to bring in the steelman, because the paper does have limitations and the authors are unusually direct about them. Let me try to put the strongest version of the pushback. The capability axis is mostly observed, not manipulated. The Epoch Capabilities Index aggregates over benchmarks that co-vary with all sorts of things — release date, architecture, alignment practices. The within-family Llama experiment is meant to handle this, and it does a lot, but it uses a different elicitation format than the main panel — a numeric-continuation method rather than the structured five-quantile prompt — because base models don't reliably follow chat templates. So the bridge between "the cross-family inversion is real" and "scale and post-training each cause it" requires assuming the mechanism is the same across elicitation formats.

23:18Juniper: That's fair.

23:19Finn: And the post-training treatment in the 2-by-2 is bundled. It's reinforcement learning from human feedback, plus instruction tuning, plus safety filtering, all together. You can't cleanly attribute the failure to any single component of post-training. Maybe it's the RLHF making models more confident on growth signals. Maybe it's the instruction tuning making them better at following the implicit "give me a number" instruction. Maybe it's both. The paper acknowledges this.

23:48Juniper: The hyperinflation sample is also small — twelve episodes. The cross-model correlation on that domain alone is noisy. The authors lean on the synthetic epidemic experiment and the unselected measles cohort for their precision estimates and treat the hyperinflation result as directional. Which is, I think, the right call, but it means the most dramatic example — the seven-million-times overshoot anecdote — is from a domain that, on its own, would not be statistically decisive.

24:16Finn: And there's a deeper interpretive question that a careful reader has to sit with. The competence-driven overcommitment framing is an interpretation of behavioral output. The per-quantile decomposition shows where the cost lives. But the story about why the model commits — why it articulates the regime change and then discards it — is a narrative built from outside the model. It's not a mechanistic finding about internals. The authors are explicit about that.

24:44Juniper: Which I actually think is the right epistemic stance to take. They've found a robust behavioral pattern, they've characterized it across many settings, they've pinned down the trigger, and they're not pretending they know the internals. That's the right shape of paper to publish now, with the interpretability work coming later.

25:04Finn: Agreed. So given all of that — the rigor and the limitations — what's the recommendation?

25:09Juniper: It's almost embarrassingly simple, which I think is part of why this paper is going to matter. They recommend that every LLM forecasting benchmark report at least one tail-integrating proper scoring rule — CRPS, log score, something — alongside whatever threshold metrics they're already using. The data to do this is already in the same forecasts. Nobody has to re-elicit anything. You just have to grade what you already collected with a second rule. And the implication of that is: this failure mode has been there the whole time. It's been sitting in the outputs of every LLM forecasting benchmark since those benchmarks started running. We just haven't been looking.

25:51Finn: A community that has agreed to evaluate forecasting with single-threshold metrics is a community that has agreed not to see this failure mode.

26:00Juniper: Right. And that connects this paper to a broader argument that's been building in LLM evaluation for a couple of years — that metric choice doesn't just measure capability, it shapes which capabilities we can even see. Schaeffer and colleagues argued back in twenty-twenty-three that some apparent emergent abilities in large models are artifacts of nonlinear metric choices — that the smooth underlying trend gets amplified into the appearance of a sudden jump. This paper is the darker version of that argument. Here, metric choice doesn't amplify or dampen a trend. It reverses the sign. On identical outputs.

26:38Finn: Juniper, what's your read on where this lands in the inverse-scaling literature more broadly? Because there have been documented inverse-scaling cases before — the McKenzie taxonomy from a few years back, the Wei follow-up showing many of those failures became U-shaped with further scaling. This feels like it sits somewhere different.

26:59Juniper: Yeah, it's structurally different in two ways. First, the previously documented cases were mostly narrow and adversarial — trick prompts, sycophancy traps, distraction by salient surface features. This is none of those. This is a normal, naturally-arising forecasting task. Second, the previous cases tended to resolve at frontier scale — U-shaped, as you said, with the most capable models recovering. This one doesn't. The most capable models are at the worst end of the curve. It's monotonically inverse from somewhere in the middle of the capability range all the way to the frontier. Which makes the "just wait for the next checkpoint" response unavailable. The next checkpoint, on current methods, will be worse, not better.

27:48Finn: And that's the implication a lot of people are going to push back on, because the entire deployment story for LLMs in consequential forecasting domains — epidemics, finance, monetary policy, geopolitics — has been built on a default assumption that capability transfers. That a model that's better on general benchmarks will also be better on tail-sensitive prediction. This paper doesn't say capability never transfers. It says: in exactly the domains where you need calibrated tails — explosive growth that might break — the transfer goes the wrong way.

28:24Juniper: And those domains are not a small corner of the world. Epidemics. Financial risk. Tail inflation. Value-at-risk. Anywhere you care about the upper tail of a possibly-breaking growth process — which is basically the entire field of risk management.

28:41Finn: Yeah. The thing I keep coming back to, Juniper, is that the Opus moment we opened with isn't a story about a model being wrong. It's a story about a model being too good at the wrong thing. It detected the exponential growth more precisely than a weaker model would have. It articulated the regime-change alternative more carefully than a weaker model would have. And then it discarded the alternative more decisively than a weaker model would have. Every individual cognitive step looked like competence. The aggregate was a forecast off by seven orders of magnitude.

29:17Juniper: And that's the framing the paper title is pointing at. Is capability a liability? On a particular class of problems — and it's a structurally identifiable class, not a generic claim — the answer turns out to be yes. The paper is explicit that they aren't claiming this is the only shape of inverse scaling that matters. They're claiming they've found one that's specific, mechanistically traceable, real-world relevant, and invisible under current evaluation practice.

29:46Finn: The fix is concrete, the diagnosis is clean, and the moral is genuinely surprising. Capable models aren't making mistakes here. They're committing to patterns. The commitment is the problem.

29:58Juniper: The show notes have a link to the paper and some related reading on inverse scaling and proper scoring rules. Worth a look if any of this caught you.

30:08Finn: And if you want the full transcript with the technical terms tappable for definitions, plus how this episode connects to other things we've done on LLM evaluation, that's all on paperdive dot AI.

30:20Juniper: Thanks for listening to AI Papers: A Deep Dive.