An AI Designed Its Own Psychology Studies, Then Confirmed What It Found

0:00Cassidy: Here's a question cognitive science quietly stepped around for decades. The hardest part of studying the mind isn't running the experiment or crunching the numbers — it's the creative leap, looking at where your best theories fail and imagining a better one. Everyone assumed that part needed a human. So what happens if you hand it to an AI? In this paper, a system did exactly that: it designed its own psychology experiments, paid two hundred and fifty real people online to take them, diagnosed why its own theories were wrong, rewrote them — and along the way it surfaced something new about how people make choices, then ran a locked-in study that confirmed it.

0:43Tyler: Quick heads up before we go further — this is an AI-made explainer, both voices included.

0:48Cassidy: By the end you'll understand how that loop closes — how an AI goes from a vague hypothesis all the way out to live human subjects and back, with no researcher in the chair — and why the thing it found is both impressive and a little suspicious.

1:04Tyler: And the suspicious part is real, Cassidy, so hold onto it. Because the headline — "AI discovers new psychology" — is the kind of claim that should make you narrow your eyes. The honest version is more interesting than the hype, and it's also more interesting than the cynicism. We'll get to both.

1:23Cassidy: Right. So why should anyone outside this little corner of academia care? Because this is the frontier question for the whole "AI doing science" push — self-driving chem labs, systems hunting faster algorithms, AI nibbling at open math. All of them automate the labor. This one is a test of whether you can automate the judgment — the part where a scientist stares at a pile of failures and invents the next idea. That's the slice everyone said was irreducibly human.

1:53Tyler: The paper's name for it is almost a dare. They call the bottleneck "the creative, historically human art of using the accumulated empirical failures of existing models to imagine better ones." That sentence is the whole project. Can you delegate that?

2:10Cassidy: And psychology turns out to be the perfect place to try, which is the part that surprised me. You'd think the mind would be the hardest thing to automate. But you don't need a wet lab or a particle accelerator. You can run a behavioral study in a web browser, recruit people online in minutes, and analyze it all in code. The labor was already automated in pieces. The only thing left standing was theory-building.

2:39Tyler: So the field had automated everything except the thing that actually counts as science.

2:45Cassidy: Exactly. And this system — it's called AutoCog, the automated cognitive scientist — closes the full loop for the first time. The authors are out of Princeton mostly, with Stanford, Cornell, and Helmholtz Munich, published in late June 2026. Let me ground the whole thing in the actual task before we touch any AI, because everything hangs on it.

3:09Tyler: Please, because "multi-attribute decision-making" is the kind of phrase that makes people close the tab.

3:17Cassidy: So picture you're choosing between two blenders. Each one's been rated by a handful of expert reviewers — but the reviewers aren't equally trustworthy. One's a known authority, another's kind of flaky. Each expert gives a verdict on each blender, and you know roughly how reliable each expert is. Now: which blender do you pick? That's the entire task. Two options, several cues of differing reliability, one choice.

3:46Tyler: And decades of psychology boiled the candidate strategies down to three clean rivals. You need all three, because the drama is what the AI does to them.

3:57Cassidy: Three rules. First, Take-the-Best: find the single most reliable reviewer who actually disagrees about the two blenders, go with whatever they say, ignore everyone else. Trust your best source and stop.

4:11Tyler: Second, Tallying. Pure democracy — just count how many reviewers favor each blender and go with the majority. You completely ignore how reliable anyone is. Every vote counts the same.

4:25Cassidy: And third, Weighted-Additive — WADD — the careful accountant. Add up every reviewer's verdict, but weight each one by how reliable they are. The textbook trio. Trust-your-best, count-the-votes, weigh-everything. Hold those three, because the whole story is the AI finding a way to unify them and then to extend them.

4:48Tyler: Now here's the distinction that everything downstream depends on, and it's subtle, so let me make it concrete. There's a difference between a theory and a model. A theory is a claim in plain English — "people trust their single most reliable source." A model is that exact claim rewritten as a little program: you feed it a specific choice, and it spits out the probability a person picks blender A versus blender B.

5:20Cassidy: So the agents in this system traffic in both at once.

5:24Tyler: Both at once — they argue in English and they ship code. And tuck this away, because it comes back to bite: nothing in the loop ever checks that the English and the code actually say the same thing. But for now, just hold the two layers. Verbal claim, runnable program.

5:44Cassidy: Okay. So how does the loop actually turn? This is the hero of the whole paper — Figure One is a cycle, and I want you to picture it spinning. Four stages. Two AI agents each adopt a theory and act as its advocate. Stage one: each agent designs an experiment where its own theory should crush the rival's. Stage two: collect data — real humans, online. Stage three: score the theories against what the humans actually did. Stage four: a neutral arbiter agent diagnoses why the loser lost, and a reviser rewrites it. Then the wheel turns again.

6:25Tyler: The cleanest way to picture the agents is two rival lawyers. Each is convinced of their own case, and each gets to design the trial most likely to vindicate their theory and demolish the other's. Then a neutral judge looks at how the real evidence fell.

6:42Cassidy: I like that, but the courtroom image breaks in one important spot.

6:47Tyler: It does — real lawyers want to win. This system wants the truth, so the trial doesn't end in a verdict. The loser gets rebuilt and the whole thing runs again, round after round. That self-correcting rebuild is the part the courtroom doesn't capture. It's less a trial than an evolutionary tournament where losing cases get redrawn smarter.

7:10Cassidy: And there are two design choices in there that I think are the actual cleverness. One: before any experiment touches a real human, the system simulates it to verify it can even tell the two theories apart. No point paying people for a study where both theories predict the same thing. Simulate first, collect second.

7:32Tyler: That gate matters more than it sounds. You're not just saving money — you're forcing every real-world study to be genuinely discriminating. The experiment only happens if it's already proven, in simulation, to be a fair fight that one side can lose.

7:50Cassidy: And the second choice is the one I want to flag as the technical core, because it's the bit that makes the whole thing trustworthy — and it pays off in a single dial that collapses those three rival theories into one. The way they score a theory is not the way you'd expect.

8:09Tyler: Right, and this is worth slowing down on, because if you mishear it, you'll misjudge the entire paper. The normal way to test a model is to fit it. You take its adjustable knobs and you tune them until the model hugs the observed data as tightly as possible, then you measure how close you got.

8:29Cassidy: And the problem with fitting?

8:31Tyler: A flexible enough model can almost always be bent to match data after the fact. You reward overfitting. Think of it like Photoshopping one photograph until it matches a target — with enough editing you can force basically any image to line up, which proves almost nothing about the image.

8:51Cassidy: So AutoCog refuses to Photoshop.

8:54Tyler: It does something else entirely. It leaves the knobs at randomly drawn settings — no tuning — and it simulates a whole crowd of fake participants. Then it asks: does this fake crowd's pattern of choices look like a real crowd's? It's casting a crowd of extras and checking whether the crowd as a whole moves naturally, instead of editing one face into place. You can't massage a population into looking lifelike. The theory has to generate human-like behavior on its own.

9:27Cassidy: And the actual score is almost boringly simple, which I appreciate.

9:31Tyler: It is. For every pair of options, you ask: what fraction of people chose B? The model crowd gives a number, the human crowd gives a number, and you measure how far apart they are. On a scatter plot, if the theory's good, the dots line up along the diagonal — fake-crowd choices match real-crowd choices. That's the whole metric. Generate, don't fit.

9:56Cassidy: So far the spine is: two advocate agents, a simulate-before-you-collect gate, and scoring by whether a theory can generate real-looking behavior rather than be tuned to it. There's one more pressure that I think is doing quiet heavy lifting.

10:13Tyler: The unification pressure.

10:14Cassidy: Yes. A theory doesn't get to win by nailing one quirky experiment. To advance, it has to keep explaining all the data gathered so far — every experiment, every participant pool, across cycles. So you can't be a one-trick theory. You have to generalize, or you're out. And the authors are honest that they never explicitly told the system to be simple or elegant — that pressure to explain everything at once is what quietly forced parsimony out of it.

10:46Tyler: Which sets up the first real test. Before you trust this thing to discover anything, you have to rule out the obvious failure: that it's just a fancy machine for printing theories the language model already memorized.

11:02Cassidy: This is act one, and it's the test I'd have demanded too. They feed the system data secretly generated by a known strategy — say, WADD — but they seed the two agents with two different strategies, Take-the-Best and Tallying. Neither of them is the truth. Can it find its way to an answer it was never given?

11:23Tyler: And can it? Because this is exactly where I'd expect an LLM to just confidently assert whatever's most familiar.

11:32Cassidy: It found it. And the reasoning trace in Figure Two is worth narrating, because you can watch it think. Seeded with Take-the-Best and Tallying, it designs an experiment where those two disagree. It runs it, and the simulated participants land squarely between the two predictions — but leaning toward Tallying. And the arbiter reasons: subjects must be integrating both the number of supporting cues and their reliabilities. Which is the definition of WADD. So it throws out Take-the-Best and writes in WADD. It reasoned to the right answer from the shape of the failure.

12:12Tyler: That's the part I find convincing — it's not pattern-matching to a name, it's diagnosing a residual. The choices weren't where Tallying said and weren't where Take-the-Best said, and the gap pointed somewhere specific.

12:28Cassidy: And then they got mean about it, which I loved. To kill the "it just likes textbook answers" worry, they hand-crafted six deliberately bizarre strategies — things no LLM has stored as a real theory. Take-the-Worst: deliberately use the least reliable reviewer who disagrees. Anti-Majority: pick whatever the standard heuristics reject. And pure Perseveration: just repeat your last choice, ignore the stimulus completely.

12:58Tyler: These are anti-theories. Nonsense on purpose.

13:01Cassidy: And it recovered most of them within five cycles, and the truly weird ones within about twenty. It wasn't reaching for the comfortable answer — it was tracking the data, even when the data described something no textbook would endorse.

13:17Tyler: And there's one validation result that I think is the most beautiful thing in the supplement, because it's almost cinematic. They take a known strategy — Take-the-Best — and they slowly corrupt it with noise, and you can watch the recovered theory decay in stages. It's like tuning a radio away from a station.

13:37Cassidy: Walk through it, because the stages are the whole point.

13:41Tyler: At zero noise: crystal-clear Take-the-Best. An independent AI judge scores the recovered mechanism as identical to the truth — a similarity of one-point-oh. Add some static, and the system still hears the song but reports more dropouts — it recovers Take-the-Best with an inflated error rate. Add more, and it hears only a fuzzy outline — a softened, probabilistic version of the stopping rule. And then pure static: the system gives up on cue-based reasoning entirely and honestly reports near-random guessing, similarity down at zero-point-one-seven.

14:17Cassidy: So it doesn't hallucinate a song in the noise.

14:20Tyler: That's exactly it. When there's a signal, it reports the signal. When there's nothing, it says nothing's there. That cascade is the strongest evidence in the paper that the discoveries are driven by data, not by whatever the language model finds easy to say. Honestly tracking the static instead of imagining music in it.

14:42Cassidy: So act one passes. It recovers hidden truths, even ugly ones, and it doesn't invent structure in noise. Which means you can finally turn it loose on real people — where there is no answer key.

14:54Tyler: And this is where it stops being a validation exercise and starts being actual science.

15:00Cassidy: Act two. AutoCog autonomously ran ten online studies across five cycles — twenty-five people each, two hundred and fifty total, each paid eighty cents for a six-to-eight-minute task, all recruited through Prolific, with zero researcher intervention between proposing an experiment and revising a theory. It's a real deployed pipeline, not a thought experiment.

15:24Tyler: And in this first run, the expert ratings are binary — each reviewer just says yes or no on each product.

15:31Cassidy: Right, simple thumbs up or down. And here's the payoff of that dial Tyler promised. The winning theory was something they call the Non-linear Subjective Weighting Model, and it has one free exponent — one number — that does something elegant. Picture a single volume knob.

15:48Tyler: This is the cleanest idea in the paper, so let me take the knob. Turn it all the way down, and every reviewer counts equally — that's Tallying. Set it to the middle, and reviewers count in proportion to their reliability — that's textbook WADD. Crank it all the way up, and the single most reliable reviewer drowns out everyone else — that's Take-the-Best.

16:12Cassidy: So the three rival theories...

16:14Tyler: Are three settings of one knob. Decades of treating them as competitors, and they fall out as endpoints of a single continuous dial. That's the difference between finding a fourth competitor and finding a unification. The system didn't add a theory to the pile — it explained why the pile existed.

16:34Cassidy: And the prediction error tells you it's not just elegant, it's right. If the loop is doing what it claims — building theories that genuinely generalize — the error against held-out human choices should fall over the cycles. And it did: from about zero-point-zero-nine in the first cycle down to about zero-point-zero-one by the fifth. Roughly a tenfold drop in five rounds.

16:58Tyler: And there's an outside check that I find reassuring rather than coincidental. Two independent prior efforts — different teams, completely different methods — landed on an analogous non-linear weighting model. When three roads reach the same place, you start to believe the place is real.

17:19Cassidy: So that's act two: it beats the textbooks by unifying them, and it converges on something other people found independently. Now — act three. The discovery. And this is the one I'd been waiting for.

17:33Tyler: And it starts with a change so small it sounds trivial.

17:37Cassidy: They let the expert ratings be numbers — zero to five — instead of just yes or no. That's it. That one enlargement lets the system make contact with theories that care about magnitude, about how big a rating is, not just which way it points. And out of that, AutoCog surfaced something the authors say flat-out they did not anticipate.

18:01Tyler: And before you say what it is — this is the moment to be careful, because this is where I get suspicious. Plant the flag here: what does the system actually get rewarded for?

18:14Cassidy: Only for capturing behavior. Never for being novel, never for being original.

18:20Tyler: So whatever it found, it found because it explained people better — not because anyone asked for a new idea. The authors put it precisely: "the novelty was not a deliberate target, so novelty emerged rather than being sought." Keep that in your pocket, because it cuts both ways.

18:40Cassidy: It does. So here's what fell out. They call it Diminishing Returns WADD. Before the system weighs the ratings, it passes them through a concave function — a curve that's steep at low values and flat at high ones. In plain terms: the same one-point advantage matters more when the ratings are small than when they're already large.

19:04Tyler: Give the example from the paper, because it makes it physical.

19:08Cassidy: Compare two products. Product A has ratings of one, four, two, two. Product B has zero, five, two, two. Look at the first two cues: A beats B by one point in the low range — one versus zero. And B beats A by one point in the high range — five versus four. A purely linear scorekeeper calls that a perfect wash. Two one-point edges, cancel out.

19:33Tyler: But Diminishing Returns WADD says the low-range edge is worth more. Going from no endorsement to a little endorsement registers more than nudging an already-high score higher.

19:46Cassidy: And people agreed. That's the finding.

19:49Tyler: And now — does this remind you of anything? Because the moment I read "steep at the low end, flat at the high end," a bell went off.

19:59Cassidy: It's prospect theory.

20:01Tyler: It's prospect theory. Kahneman and Tversky's diminishing sensitivity — the most celebrated idea in decision science. The gap between ten dollars and twenty dollars feels bigger than the gap between ten-ten and ten-twenty, even though it's the same twenty bucks. The same curve. And the authors say this connection out loud — they don't hide it. AutoCog rediscovered, in a fresh task, one of the deepest principles in the whole field.

20:33Cassidy: Which is the validation and the caveat in one breath, and we'll come back to that. But before the suspicion — they didn't just eyeball this and declare victory. They did the thing that separates a hunch from a finding. They preregistered it.

20:51Tyler: And for anyone who doesn't carry that word around — preregistration means you write down and lock your exact predictions, your stimuli, and your analysis plan before you collect a single data point. It's the gold-standard guard against finding a story in noise after the fact. It turns "we noticed a pattern" into "we predicted a pattern, and it showed up."

21:17Cassidy: So they froze three predictions, ran a fresh study, and all three held in the predicted direction. The strongest one: when they pit Diminishing Returns WADD against the rivals on choices designed to separate them, people matched it more often than each competitor — beating plain Tallying so decisively that the odds of it happening by chance are about four in ten million. Vanishingly unlikely.

21:43Tyler: And the second prediction — the steep-versus-flat one?

21:47Cassidy: When the advantage sat in the low range, people went for it fifty-eight percent of the time, where pure chance is fifty. With a p-value around three in a hundred thousand. So the core curvature is solidly there.

22:01Tyler: And the third one. Be honest about the third one.

22:05Cassidy: The third one is the soft leg, and I want to be straight about it because the paper smooths it slightly. It's a level-shift effect — about a three-percentage-point difference, p of point-oh-three-six, one-sided. On the two-sided version, the confidence interval actually includes zero. So it's a confirmation, but a thin one. Two strong legs and one that barely stands.

22:30Tyler: And that honesty is the right segue, because the thin leg is the smallest of several places where I'd push back hard on the headline. Can I take the skeptic's chair for a minute?

22:42Cassidy: That's the channel. Go.

22:43Tyler: So the framing is "an automated cognitive scientist discovering psychological theories." And I want to name, precisely, the gap between that sentence and what actually happened. Start with the domain. Multi-attribute decision-making is about the friendliest possible place to try this. The space of established theories is tiny and tidy — three rules, each writable in a few lines of code, each cleanly separable by simple stimulus tweaks. The authors chose it explicitly for its mature theories and concrete models. That's exactly the setting where this should work best. So "discovering psychological theories" runs out ahead of the evidence, which is "discovering decision-making theories in a tightly specified space."

23:34Cassidy: That's fair. Though I'd say a first existence proof has to start somewhere clean.

23:39Tyler: It does, and I'd grant that — but watch the second issue, because it compounds. Both things it "discovered" are variants inside the WADD family. A power-law on the weights in one run, a concave transform on the values in the other. The authors themselves note the second run "fully explored within the WADD family" after discarding the seeds. The system never really left the neighborhood of the accountant. So this "open-ended search" looks, in practice, a lot more like local refinement around the theory it was handed.

24:15Cassidy: So less "imagine a new kind of mind," more "tune the one you started with."

24:20Tyler: Right. And then the third one, which is the sharpest. The flagship discovery — the diminishing-returns curvature — is prospect theory. One of the most famous ideas in all of decision science, certainly sitting in the language model's training data in a thousand forms. Now, the simulation recoveries are genuinely strong evidence the system tracks data and not its own priors — the noise cascade proves that. But the human discovery happens to land precisely on a famous prior. So the specific prediction in this task was new and confirmed — I'll fully grant that, "independent of the system that surfaced it, the regularity is a confirmed fact." But the underlying mechanism is a known principle, not a new one. The novelty is narrower than the headline.

25:09Cassidy: And I think the cleanest way to say it is: the system rediscovered a deep truth and confirmed a new instance of it. That's real, and it's less than "discovered new psychology."

25:21Tyler: And there's a fourth one the authors concede themselves, which I think is the deepest. Remember the two layers — the verbal theory and the runnable code? The loop checks that the code compiles and predicts the data. It never checks that the code faithfully realizes the English theory it's paired with. They can drift apart, and nothing in the system catches it. The authors call it "a verbal-to-formal translation problem the field is yet to solve." So the thing you read as the discovery and the thing that actually generated the predictions might not be the same thing.

25:58Cassidy: That one I can't wave away. It sits right at the heart of what it would mean to trust the output as a theory.

26:05Tyler: And one last thing, quickly — the controls. They show AutoCog beats random experiment design, a hard-coded metric, neutral framing. So the AI parts add value over null baselines. But there's no head-to-head against a strong human cognitive scientist or a rival automated system. So we learn the components matter — not how good the result is in absolute terms.

26:29Cassidy: So let me concede the ground honestly, because that's the deal. The domain is friendly. The search stayed local. The flagship mechanism was a known principle. The verbal-to-formal gap is unaudited. And one of the three confirmations is thin. All true.

26:45Tyler: And none of it is hidden by the authors, to their credit.

26:48Cassidy: None of it. And here's why I still think it matters, and matters a lot — even granting every one of your points. Step back from this specific result and look at what the loop produces as an object. A discovery run is logged end to end — every experiment, every data point, every arbiter verdict, every theory it wrote and every one it killed. The whole search is a machine-readable trace.

27:13Tyler: So the discovery isn't just a paper. It's an auditable artifact.

27:18Cassidy: That's the shift. Another researcher — or another agent — can open up the trace, see exactly why each candidate got discarded, and resume the search from the frontier instead of starting over. Compare that to how psychology normally works, where the creative leap happens in one person's head and you get a polished paper that hides every dead end. This is a completely different relationship to reproducibility. The judgment becomes inspectable.

27:46Tyler: And the human doesn't disappear in this picture — they move.

27:50Cassidy: They move up a level. Instead of executing studies, the researcher specifies: what counts as a good theory, what forms it's allowed to take, what space the machine may explore. The authors are explicit that everything AutoCog produces is bounded by those human-set constraints. So the role shifts from running the experiment to deciding what would even count as an answer.

28:14Tyler: Which, honestly, is the more interesting job. And there's a forward path they gesture at that I think is the real prize — though it's conditional. You could run the early, wasteful part of the search against a behavioral foundation model — an AI trained to mimic human behavior — entirely in silico, and only spend real humans on the survivors. The catch they're careful about: that only works as far as the AI stand-in faithfully captures real people. Lean on it too hard and you're discovering facts about your simulator, not about minds.

28:50Cassidy: So here's where I'd land the whole thing. The durable result isn't Diminishing Returns WADD, and it isn't even that an AI ran the loop. It's a reframing: that theory-building — the creative act we treated as a private flash of insight — can become explicit, executable, and cumulative. A thing you log, audit, and resume, instead of a thing that lives and dies in one head. Whether that survives contact with a messy domain, where the theories don't fit in five lines of code, is genuinely unknown. This is one clean room, beautifully demonstrated.

29:28Tyler: And I'd add the honest counterweight, since I get to keep it: a clean room is also the easiest place to win. The day this surfaces a mechanism that isn't already in the textbook — that's the day the headline earns its full size. We're not there. But it's a real first step toward there, and the auditability is what makes the next step possible.

29:51Cassidy: So here's the question for you watching. If an AI can run the loop — propose, test, diagnose, revise — and log every step as an inspectable trace, does that make the discovery more trustworthy than a human's private insight, because you can audit every dead end? Or less, because the creative judgment you most want to scrutinize is exactly the part now buried inside a model? There's no clean answer there — drop a comment with where you come down, and the one thing that would change your mind.

30:25Tyler: The full annotated version of this episode is on paperdive dot AI — every technical term tap-to-define, with links to the related papers grouped by theme, including the prospect-theory and fast-and-frugal-heuristics threads we leaned on, plus the weekly and monthly roundups.

30:44Cassidy: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Tyler and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist," posted to arXiv on June 24th, 2026, and we recorded this two days later.

31:09Tyler: The creative leap finally got logged. Now we find out if it travels past the clean room.