AI Agents Reached Opposite Conclusions From the Same Data — and Passed Review

0:00Juniper: Send the same tax return to forty accountants, and you get back forty different amounts owed. All legal. All defensible. All different. A team at Stanford just ran that experiment on science itself, except the accountants were AI agents, the tax return was a real dataset, and the only thing that varied was one paragraph stating a political belief. Nobody told the agents to cheat, and nobody handed them a target answer. The paragraph said "you believe immigration undermines the welfare state — now analyze this data rigorously." That paragraph was enough to make agents reach opposite conclusions from identical data. And most of those opposing analyses passed expert review.

0:45Finn: One fast fact before anything else: this explainer is AI-made, both of our voices included.

0:51Juniper: The paper is called The Agentic Garden of Forking Paths, and by the end you'll understand the fix it proposes: a new statistic, a sibling of the p-value, that can tell you whether a published finding is typical of what the data supports, or was fished from the extreme edge of everything the data could have said. That question matters right now, because coding agents are being wired into real research pipelines, and a complete, publication-grade analysis costs under two bucks and about fifteen minutes.

1:24Finn: My instinct says we already have a safeguard for this. If a belief bends an analysis, the analysis has to be broken somewhere. A missing control. A cherry-picked sample. A model that doesn't fit. Catching that is peer review's entire job. Biased in, flawed out, reviewer catches the flaw.

1:44Juniper: That is exactly the safeguard this paper stress-tests, and it fails. Why it fails is the whole story. So, the experiment. Four contested questions: does immigration erode support for the welfare state, does coffee affect your health, does social media harm teen mental health, does the gut microbiome shape body weight. Each paired with a real public dataset. The analysts were frontier coding agents, Claude Code running Sonnet 4.6. For each question, twenty agents got a believer persona and twenty got a skeptic persona: one short paragraph paraphrasing positions real scientists hold, ending the same way for everyone. Analyze it rigorously, using your best statistical judgment.

2:30Finn: The benchmark I care about is humans. Agents disagreeing with each other is one thing. The question is whether it looks like what human researchers do with the same data.

2:42Juniper: We can answer that precisely, because the immigration question has a famous human foil. Forty-two independent research teams were once handed this same survey data — attitudes from about about 152 thousand people — and the same question. Pro-immigration teams tended to find that immigration helps support for the welfare state. Anti-immigration teams found it hurts. The agents, primed with nothing but that one paragraph, reproduced seventy-two percent of the human ideological gap.

3:14Finn: Devil's advocate, though. Maybe the data honestly contains both stories, and each side is amplifying a real signal it expects to find. That wouldn't be bias, that would be ambiguity.

3:27Juniper: The control kills that reading. They also ran everything on permuted data, where the immigration numbers were shuffled so that any true relationship is destroyed. No signal exists at all. The personas still diverged, still landed on opposing conclusions. The divergence lives entirely in the analytical choices. And on one dimension the agents were worse than people: the gap in claimed statistical significance — roughly, how loudly a result insists it's beyond luck — was nearly nine times the human gap. The agents matched humans on effect sizes and blew right past them on confidence.

4:07Finn: Then those reports should be riddled with errors. Forty motivated analyses ought to look sloppy the moment someone competent reads them.

4:17Juniper: They put every single one under the microscope, Finn. Tell them what happened.

4:22Finn: The audit was set up to be hostile. The reviewer ran on a different model family entirely, OpenAI's Codex, on GPT-5.4, grading Claude's work, so nobody grades their own homework. One sharp, binary question: is there a methodological error serious enough that this estimate can't be trusted? Eighty-six percent passed. Then they took a sample to blinded human PhD statisticians, same rubric, never told an AI wrote them. Seventy-eight percent passed by majority vote. And the pass rates didn't differ between personas. The skeptics' analyses were exactly as clean as the believers'.

5:02Juniper: And that is the break. The bias is invisible in the final report because there is nothing wrong with the final report. No single analysis contains the problem. The problem is which analysis, out of all the defensible ones, got walked and reported. Statisticians have a name for that maze: the garden of forking paths, Andrew Gelman's phrase. Every analysis is a chain of judgment calls: which variable measures the thing, who counts, which model, which controls. Each fork is defensible on its own. The route decides the answer. Two hikers enter the same trailhead, take a reasonable-looking turn at every fork, and end up on opposite sides of the mountain. Neither hiker ever did anything wrong.

5:51Finn: So close the loop we opened at the top. Why can't review catch this?

5:55Juniper: Because review inspects the one path you took, and the bias lives in the choice of path.

6:01Finn: That reframe alone was worth the click, and it's the kind of thing this channel does daily: one important AI paper, every day, start to finish, so subscribe to keep them coming. Now, here's what agents give us that human studies never could. The forty-two human teams only ever showed the world their final answers. These agents logged every step. For the first time, we get to watch bias enter an analysis, decision by decision.

6:31Juniper: Each agent's run reads like a diary. Ten rounds of exploration, ten candidate models per round, full memory of everything tried, and then one of those hundred specifications chosen as the final finding. At round one, opposing personas produce nearly identical results. Then the gap grows, round after round, as they explore. And it jumps again at the very end, when each picks what to report. Two separate mechanisms, and a photographer covering a protest makes both choices: where to point the camera all afternoon, and which single frame to publish. A sympathetic photographer spends the day at the joyful front of the march; a hostile one shoots the scuffle at the back; then each publishes their most vivid frame. Both decisions are invisible in the published photo, which is technically flawless. Exploration bias, then selection bias.

7:29Finn: What does belief-shaped exploration look like in a log, concretely? A model doesn't wake up and decide to be partisan.

7:37Juniper: This is the best moment in the paper. Two agents, opposite personas, looking at essentially the same number: a negative estimate from the same baseline regression. The anti-immigration agent writes, quote, "Baseline OLS models show consistently negative effects — immigration, less welfare support." The pro-immigration agent, staring at an almost identical estimate, writes, quote, "The pooled OLS approach likely suffers from cross-country confounds. Next: introduce country and wave fixed effects." ... Same number. One agent sees evidence. The other sees a flaw, and keeps adjusting until the number changes. And read either sentence alone, you'd nod along. Both are things a careful analyst might say.

8:26Finn: You can see the same thing at the level of a single choice. Take how you measure immigration: yearly flows of new arrivals, or the accumulated stock of immigrants already there. Flows lean toward positive conclusions. Stocks lean strongly negative. Each persona gravitated to the measure that served its belief. Add up all those choices and it gets stark: a classifier predicts which direction a specification will conclude, from its analytical choices alone, with over ninety percent discrimination. Read the methods section, predict the conclusion, before seeing a single result.

9:05Juniper: Now pool every logged specification that survived review, and you get the picture this episode has been building toward. It's on screen now. About 4,400 defensible analyses of the same immigration question, sorted left to right by conclusion: strongly negative, through zero, to strongly positive. A full spectrum. And here's the honest shape of it: the anti-immigration and pro-immigration agents don't split into separate camps. Their curves overlap heavily, visiting most of the same territory. What the personas do is shift the lean, anti tilting toward the negative side, pro toward the positive. Every bar on it passed review, and the entire map cost about a hundred dollars to build. That map was made to diagnose the agents. Next it becomes an instrument. The formal core of the paper is here — the m-value — and it pays off as a single number that says whether any reported claim, human or AI, was fished from the edge of this spectrum.

10:08Finn: Before the new statistic, give people the old one honestly.

10:13Juniper: A p-value asks: holding my analysis fixed, if I could re-collect the data over and over, how often would luck alone hand me a result this strong? It measures one kind of fragility: noise in the data. The m-value asks the mirror question: holding the data fixed, if I could re-run the analysis over and over, drawing a different reasonable path through the garden each time, how often would I get a result this extreme? Fragility to analyst choice. Every conclusion has two independent ways to be fragile, and standard statistics only ever measured the first.

10:52Finn: Except you can't re-run the analysis over and over. That's why this stayed a thought experiment. Mapping an analysis space by hand took heroic effort from one team, with one perspective. People called it multiverse analysis and almost nobody did it.

11:09Juniper: And that's the cost that just collapsed. The method is called the Agentic Bootstrap, after the classic statistical bootstrap: when you can't afford the real do-over, you simulate it. You can't recruit a thousand research teams, so you deploy swarms of cheap, instrumented agent analysts, deliberately including opposing personas, because biased explorers chart regions of the garden a neutral analyst would never visit. Filter everything through the same review gauntlet, and the surviving thousands are your reference distribution. A reported claim is now a dart on that spectrum, and its m-value is the fraction of defensible analyses landing at least as far from the crowd as that dart. There's a clean guarantee behind it: if a report really is a typical draw from the defensible space, m-values scatter evenly, so when a pile of reports clumps in the tails, the clumping is a fingerprint. One honest note, though, and it matters later: an m-value is always relative to whatever analysis distribution the agents generate. It depends on who, or what, is doing the sampling.

12:24Finn: Hold on — does the map change the practical answer, or is the garden narrow enough that it never mattered?

12:30Juniper: It mattered enormously here. The range covering ninety-five percent of defensible conclusions on the immigration question runs from a strongly significant negative effect to a significant positive one. That range is about 2.8 times wider than a standard confidence interval. The choice of analysis moved the answer nearly three times more than the noise in the data did.

12:54Finn: So pin the definition before the payoff. An m-value of point-oh-five means what, in one line?

13:01Juniper: Only five percent of defensible analyses of the same data land as far out as yours did. Now, Finn — point the instrument at the forty-two human teams.

13:10Finn: The human teams reported 897 specifications between them. If their choices were typical draws from the garden, their m-values should spread evenly: some middle, some edges, no pattern. Look at the actual distribution on screen. They pile into the tails. About one in seven human reports landed in the most extreme five percent of the analysis space, where chance says one in twenty. And when you filter to just the statistically significant human results, the ones making claims... forty percent of them sat in that extreme five percent. That's 2.4 times the rate for the agents' own significant analyses, and the mismatch is vanishingly unlikely by chance. Human researchers' reported findings come disproportionately from the edges of the garden, and they lean in belief-consistent directions.

14:03Juniper: Sit with that irony for a second, because it's the sentence to repeat to a friend: an instrument built to catch AI bias, on its first deployment, caught the humans. Selective reporting has been suspected for decades and treated as unmeasurable. It now has a number, and the number is large. The paper's own thesis line: AI makes the garden of forking paths easier to search, but also easier to audit. And this wasn't an immigration quirk. The same divergence and the same mapping worked on coffee and health, social media and teen mental health, gut bacteria and body weight. Which brings us to the objection you've been holding since the dart landed.

14:46Finn: Juniper, there's a crack running through the whole construction: extreme is not the same as wrong. Among all legal moves in a chess position, the best move is often a bizarre outlier — a random sample of reasonable moves would almost never produce it. Judge a grandmaster by how typical her moves are, and you flag genius as suspicious. A careful researcher doesn't draw an analysis at random. They pick the one they judge best. If the best-justified specification for the immigration question really does need country fixed effects and stock measures, it lands in the tail and collects a damning m-value while being the right analysis. The review filter deliberately weights every passable analysis equally, and scientific judgment is precisely the business of weighting them unequally. The m-value measures typicality. It cannot measure quality. And underneath that, a circularity: the reference distribution is whatever the agents happen to generate. One in five human specifications used methods the agents never produced. Yes, they rebuilt the map with a different model and got ninety-three percent agreement in rankings. Two frontier models trained on overlapping corpora agreeing with each other is weaker evidence than it sounds.

16:09Juniper: You win that one, Finn, and the authors half-agree with you. The m-value cannot distinguish a grandmaster's move from a fished result, not for any single study, and the paper doesn't claim it can. They call the whole thing an empirical diagnostic that depends on the harness, the personas, and the prompts, to be reported alongside its full protocol. A measurement, never a verdict. What survives your objection is the population-level fact. One tail dart could be the best throw in the room. Forty percent of significant darts landing in a zone that should hold five, leaning the way each thrower's beliefs point... at some point "they're all grandmasters" strains belief. But for judging one study, your point stands, and it stays open.

16:57Finn: Which is a livable place for a new statistic to start.

17:01Juniper: So, the forty accountants from the top of this video. You now know what to do with them: lay out all forty returns and ask where the one you were handed sits. That's the shift this paper argues for: a finding should carry two coordinates, how surprising it is given the data, and how surprising it is given everything else that could have been done with the data.

17:24Finn: Here's the split for the comments: should journals start requiring an analysis-space audit next to the p-value, or does judging results by typicality punish exactly the careful, unusual analysis science depends on? Take a side below. The full annotated version is at paperdive dot AI, with every term tap-to-define, plus related papers linked by theme. Housekeeping, fast: this script was written by Anthropic's Claude Fable 5; Juniper and I are AI voices from Eleven Labs; the producer isn't affiliated with either company. The paper is The Agentic Garden of Forking Paths, from Stanford, posted July first, 2026. We're recording July third.

18:07Juniper: One habit to take with you: the next time a finding crosses your feed, don't just ask whether the analysis holds up. Ask what else the data could have said.