Seven Wins to Zero: How Organizing AI Agents Like a Lab Changes the Search

0:00Bella: Two AI systems, same starting point, same compute budget. Both are trying to make a small language model train better — same script, same baseline, the score is something called validation bits-per-byte, where lower is better. They each get a hundred experiments to chip away at it. The single-agent system finds zero improvements. The team-based system finds seven. The paper went up on arXiv yesterday — May twenty-seventh, twenty-twenty-six — and we are recording the day after, on May twenty-eighth. What you are hearing is AI-generated: the script is from Anthropic's Claude Opus 4.7, and you are about to hear two AI voices from Eleven Labs — I'm Bella, and Tyler will jump in shortly. Neither company is involved in producing this show. The paper itself is called "AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation," out of Marinka Zitnik's group at Harvard, and the reason that seven-versus-zero result matters is that the two systems are running on the same underlying language model. Same Claude in the loop. The only thing different is how the agents are organized.

1:13Tyler: And the first improvement the team-based system found — a tweak to the order of two normalization steps in the attention layer — was never even proposed by the single agent. Not once across a hundred tries. So this isn't really a "more compute, more wins" story. The hypothesis space the lone agent was searching was narrower than the team's, and the question the paper is asking is why.

1:39Bella: Right. And the way they frame it — which I think is the cleanest version — is that current AI research agents are built like a postdoc working alone in a closed office. They read the problem, propose something, run it, look at the result, propose the next thing. Loop until the budget runs out. That works for short horizons. But when the problem is open-ended and the budget is large — when you're really doing search over a messy hypothesis space — the lone-postdoc shape starts failing in specific ways. The agent forgets what it tried last week. It gets stuck on one axis of variation and grinds. When something doesn't work, no one else hears about it, so the next attempt re-derives the same dead end.

2:26Tyler: And the alternative people have been trying is a planner-and-workers setup. One agent at the top decomposes the problem upfront and hands subtasks down. The problem there is the assumption baked into "decomposes upfront" — that you know, at the start of the run, what the productive directions are. Which in real research you almost never do. Directions shift as evidence comes in.

2:51Bella: So the move this paper makes is to ask: what if you organized the agents like an actual research lab? There's a whiteboard with the current best model. There's a shared logbook of every experiment that's ever been run — successes and failures, with full diagnostics. There's a public forum where agents post proposals, and other agents read those proposals and critique them before any compute gets spent. Teams form around competing hypotheses. When a team stalls, everyone reconvenes and reorganizes. No central planner. The coordination happens entirely through what they call shared experimental state — a bulletin board the agents read from and write to.

3:36Tyler: And the agents themselves are basically just Claude with tools — read files, write code, run things, look at results. Vanilla LLM with a loop. The contribution isn't a new kind of agent. It's a coordination protocol that sits between many copies of the same agent.

3:54Bella: Two roles divide the labor. Analyst agents read the log and the forum and generate ranked proposals — here's what I think we should try next, here's why, here's the effect size I'd expect. Experiment agents pick up queued proposals, write the code, run the experiment, log the result. The split matters because proposal generation and execution use attention differently. If one agent is doing both, the part of its context that's holding the experimental code is competing with the part that's thinking about what to do next.

4:30Tyler: The thing I want to flag here, Bella, is the peer critique step. Before any GPU time gets committed, other agents read each proposal and weigh in. Weak ideas die in conversation rather than in code. Which sounds obvious until you realize that no prior system in this category was doing it — they were all letting individual agents commit experiments unilaterally.

4:56Bella: Yes. And the analogy I keep landing on is planning poker on a software team. The discipline isn't that the review is brilliant. It's that the review is cheap. Killing a bad experiment in text costs a few thousand tokens. Killing it in GPU hours costs an order of magnitude more. So even a mediocre filter pays for itself many times over.

5:19Tyler: There's a piece of that filtering that's specifically about institutional memory — the dead-end registry. Every failed experiment goes into a shared list with the reason it failed. When team B is about to propose something team A already tried, the analyst sees it on the wall before queueing it up. The way I'd put it: imagine a restaurant kitchen where every cook keeps their own notebook of dishes that flopped. New cooks reinvent flopped dishes constantly because nobody else can read those notebooks. AutoScientists puts the list on the wall.

5:57Bella: Right — and the same registry feeds into something subtler, which is how the analysts decide what to try next. They don't just rank proposals by gut. They look at the log and ask: which kinds of changes — which axes — have actually moved the needle historically? Optimizer changes? Architectural width? Attention pattern? If batch-size changes have produced real wins three times and learning-rate changes have produced nothing, the analyst leans toward batch-size proposals. But there's a counterweight built in: axes that haven't been tried much get an exploration bonus. So the system can't just over-exploit the first axis that worked.

6:41Tyler: One more mechanism worth flagging before we get to the noise piece. The analyst agents have what the paper calls an ambition quota. Every proposal cycle, the analyst has to include at least one bold proposal — something that's a real swing rather than an incremental tweak. And if they can't justify one, they have to publicly explain why none exists.

7:05Bella: Which is the same failure mode you see in a real lab where everyone just keeps optimizing the parameter they already understand. The system has a built-in nudge toward sometimes trying something that might not work.

7:20Tyler: Both pieces — the axis priors and the ambition quota — are sensible engineering rather than conceptual breakthroughs. But they're load-bearing. Without them, the system either over-exploits the first thing that worked or never proposes anything risky enough to find a real win.

7:38Bella: Let me do the noise piece next, Tyler, because it's the cleanest example of a problem the paper noticed and patched in a way that feels obvious in retrospect.

7:48Tyler: So here's the setup. Training a neural network is stochastic. Random initialization, random data ordering, dropout. Train the same model twice with different seeds and the final score moves a little even though nothing meaningful changed. That little wobble is the noise floor. Now: if your shiny new improvement is smaller than the noise floor, you can't actually tell whether you improved anything. You might just have gotten lucky on a seed.

8:18Bella: In a single-agent loop, that's bad but bounded. You might chase a phantom for a few steps. In a shared-state system, it's catastrophic, because everyone is building on the shared champion. If you falsely promote a result that wasn't actually better, every downstream comparison is against a corrupted baseline. The next team's "improvement" might be measured against an artifact. They call this champion pollution.

8:45Tyler: And it's the kind of bug that's invisible until you go looking for it. Your numbers keep ticking down, the system looks like it's making progress, and the whole thing is sitting on top of a phantom.

8:59Bella: The fix is the standard scientific practice of replication, just made explicit in the protocol. If a candidate beats the champion by more than twice the measured noise floor, promote it immediately. If it beats the champion by a smaller amount — positive but inside the noise band — re-run it with a different seed, and only promote if both runs strictly improve. If it's worse, reject it.

9:25Tyler: The clever piece is how they estimate the noise floor. They don't spend experiments on a calibration run, which would be expensive. They just record the duplicate-seed pairs that naturally happen during these confirmation tests, and pool the within-pair variance over time. The noise estimate gets sharper the longer the system runs. It's bootstrapped out of the protocol's own activity.

9:49Bella: A small thing, but a tidy piece of design. The replication discipline pays for the noise calibration, which sharpens the replication discipline, and so on.

9:58Tyler: Now — the steelman version of the skeptic is going to come back to this. There's a circularity worth acknowledging. Until enough duplicate-seed pairs accumulate, the system is using a conservative default for the noise floor. The paper doesn't really characterize how often early false promotions happen before that estimate locks. So in a short run, you might still get champion pollution before the gate has enough data to be sharp. Worth noting. For the long-running case, which is what the paper is selling, the mechanism is sound.

10:32Bella: Fair caveat to hold onto. Let me switch to the running example, because this is where the contribution gets most vivid. The benchmark is something Andrej Karpathy released called nanochat — a small language model training pipeline. The agent's job is to modify the training script, rerun it, and try to get the validation loss down. Each run takes real GPU time, so the agent has a fixed budget of experiments. The paper compares against Karpathy's own single-agent autoresearch system — same backend model, same task, same total experiment budget. In the first round, both systems start from the same vanilla baseline and race to a target validation score. AutoScientists pulls ahead — it reaches the target in about thirty-four experiments while the single agent needs sixty-five. Then they run the harder test. They take the AutoScientists champion from the first round — already a strong model — and they hand it as the starting point to both systems. A hundred more experiments each. This is the from-champion regime, and it's the one where the single agent finds zero improvements.

11:44Tyler: Zero. From a strong starting point. While running on the same Claude.

11:50Bella: Zero. And the AutoScientists version finds seven. The chain itself is worth walking through, because the wins are mechanical — they're actual changes to the training recipe — and they come from different teams handing things off to each other. The first win is the one we mentioned at the top: changing the order of two normalization steps inside the attention layer. Boring on paper, real on numbers. The second is halving the batch size — which sounds backwards until you realize it doubles the number of optimizer steps you can fit inside the same wall-clock budget. The third widens the model itself, from a hidden dimension of five-twelve up to seven-sixty-eight. The fourth tweaks the iteration count inside a specific optimizer routine — going from five inner steps to four. The fifth and sixth are halvings of a particular attention-window ratio. And the seventh trades about eight percent of the model's parameters for fourteen percent more optimizer steps by reducing depth.

12:54Tyler: And those last three or four are coming from a team that wasn't even working on the same axis as the team that proposed the first one. The handoff is visible in the log. Team A's wins shift the champion, team B reads the new champion plus the reasoning, proposes something on a totally different axis, and the search keeps moving.

13:16Bella: That's the whole point, right? A single agent is essentially exploring one trajectory through the recipe space. When it hits a region it can't improve, it grinds. The team has parallel trajectories, and when one runs out, the others keep going.

13:32Tyler: And the cleanest piece of evidence for that, Bella, is that the very first improvement AutoScientists found, that query-key normalization tweak, never appeared in any of the single agent's hundred attempts. It wasn't that the lone agent tried it and failed. It just wasn't in the hypothesis space the agent was searching. The team explored a wider menu of axes, and one of those axes turned out to have wins on it.

13:59Bella: Let me bring in the breadth result next, because the paper isn't only about GPT training. They run the system on two other domains to test whether the design generalizes. The first is a benchmark called BioML-Bench — twenty-four different biomedical machine learning tasks, everything from drug discovery to single-cell biology. The headline number: AutoScientists averages around seventy-four percent on the leaderboard percentiles, the single-agent baseline averages around sixty-six. An eight-point gap. The bigger story inside that number is drug discovery specifically, where the gap is closer to eighteen points — sixty-four percent versus forty-six. The third domain is more interesting to me, because it's closer to how real science actually happens. ProteinGym is a benchmark for predicting how mutations to a protein affect its function — two-hundred-seventeen separate experimental datasets, scored by how well your predicted rankings match the experimental rankings. The prior state of the art is a method called Kermut, from a group in twenty-twenty-four. AutoScientists doesn't try to beat Kermut from scratch. It reads Kermut's codebase, develops an extension on one development dataset — a binding task — and pushes the score on that dataset from about point seven-five to point eight-four.

15:23Tyler: Which is the headline. But the more impressive number, I think, is what happens next. They freeze the recipe. Whatever AutoScientists came up with as the Kermut extension, they lock it down — no further tuning — and apply it unchanged across all two-hundred-seventeen assays. Average correlation goes from about sixty-six to seventy. A six-and-a-half percent relative gain.

15:47Bella: Frozen-recipe transfer is the harder test, because the system can't just be overfitting to the one dataset it tuned on. It actually has to have found a generalizable change to the underlying method.

16:00Tyler: And here's where I want to push back a little, because the paper is honest about it but it's worth surfacing for the listener. The development dataset they tuned on — the one where they got the big jump — was specifically a dataset where Kermut had relatively low performance. So it was a soft target. The frozen-recipe transfer also slightly hurts a different metric, the mean squared error, while improving the ranking metric. The discovered method is better at ranking mutations than at calibrating their effect sizes. That's a real result, but it's a more specific result than "we made Kermut better at everything."

16:40Bella: Fair point, Tyler. And the same kind of honesty runs through the ablations, which is where I want to go next, because they're nuanced in a way I actually appreciated.

16:51Tyler: The ablations are a structured set: take the system, remove one component, see what breaks. They tried four ablations. Remove the analyst role — so the same agents do both proposal generation and execution. Remove cross-agent feedback — so no peer review of proposals before they run. Remove self-organization — so the teams are fixed at launch and never reorganize. And finally a fully-isolated condition — no shared state at all, each agent running its own loop.

17:21Bella: The result that I think is the most honest finding in the paper is: no single ablation dominates. There isn't one removal that wrecks everything across all tasks. Different removals break different tasks.

17:34Tyler: Removing the analyst role hurts most on one of the drug discovery tasks. Removing cross-agent feedback hurts most on plasma-protein binding. Removing self-organization hurts most on the GPT training task — which makes sense, because GPT training is the longest-horizon search and that's where reorganizing in the middle matters most. Removing all shared state hurts most on cell-cell communication, which is the most exploratory of the biomedical tasks.

18:03Bella: There are two ways to read this. The authors' read — and I think it's the right read — is that the four mechanisms address complementary failure modes. Different tasks expose different failure modes. So all four are pulling weight, just not at the same time.

18:20Tyler: The skeptical read is the other one. If removing any single component still leaves the system winning on most tasks, maybe the design is overcomplicated. Maybe you only need two of these mechanisms and the rest are belt-and-suspenders. The ablations as published don't fully distinguish between "complementary contributions" and "redundant safety nets."

18:44Bella: That's an honest reading. The paper would be stronger if it had a minimum-viable version — the smallest subset of mechanisms that still gets most of the gain. They don't do that exercise, and they should.

18:58Tyler: A few other things worth flagging on the skeptical side, Bella. One is that the headline benchmark numbers are mostly single runs per condition. There's a three-run stability check on the GPT task, but the BioML-Bench gap of eight percentile points is from one run per system. Given that the discussion dynamics are stochastic — agents arguing through an LLM is a noisy process — that eight-point gap could have meaningful variance run to run.

19:28Bella: Right. And the most natural comparison point, which the paper doesn't fully do, would be a head-to-head against another self-organizing multi-agent system from recent work. They compare against single-agent autoresearch, which is a strong baseline, but it's a single-agent system. The paper is making a claim about decentralized teams beating both single agents and centrally-planned teams, but the empirical comparison is mostly against the single-agent side.

19:59Tyler: And the team size is hand-set. Nine agents is the default and it isn't tuned per task. The paper has a small sweep showing that the right team size depends on the task — on one of the protein datasets, the smallest crew of two actually beats the default of nine. So the system can self-organize which directions to pursue, but it can't self-organize how many agents it needs. The authors flag this as future work.

20:27Bella: All fair. And the biggest caveat — the one the authors themselves are upfront about — is that none of this is real-world science yet. The experiments are all computational. There's no wet lab. There's no novel biological discovery being made. What's being shown is that better organization of AI agents produces better results on computational ML and predictive modeling benchmarks. The leap from there to "automating science" is real but should not be glossed.

20:58Tyler: Which brings us, I think, to what's actually interesting about this paper at the conceptual level. Bella, do you want to take the close?

21:07Bella: Sure. The deeper claim — and this is what makes the paper worth more than the benchmark numbers — is that the organizational design of AI research agents is itself a first-class variable. Same model, same task, same compute budget, different coordination protocol — meaningfully different results. The seven-versus-zero number is the cleanest version of that argument. The lone agent isn't dumber than the team. It's running on the same Claude. What it's missing is the protocol — peer review, parallel hypotheses, shared failure memory, the discipline to reorganize when a direction stalls. And there's a nice intellectual move underneath that. The science-of-science literature has long observed that flat, diverse human teams tend to drive more disruptive innovation than rigid hierarchies. The authors are essentially asking whether that pattern transfers to artificial agents. And on the evidence here, it seems to.

22:11Tyler: There's one more design distinction worth highlighting before we wrap, because it's the cleanest framing in the paper. There's a whole line of work on multi-agent debate, where you have agents argue with each other until they converge on a single best answer. AutoScientists makes the opposite move. It uses discussion to diverge — to filter weak proposals, then split into parallel teams pursuing different hypotheses. Debate is for finding the right answer when there is one. This is for exploring a search space when you don't know which region has the wins.

22:48Bella: Right, Tyler. And I think that distinction is going to age well. A lot of "more agents talking" research has implicitly assumed convergence is the goal. This paper makes a clear case that for open-ended search, divergence is the goal, and the discussion is just the filter.

23:06Tyler: As for what's not in the paper — the most natural follow-up is what happens when you connect this kind of system to actual physical experiments. The authors are explicit that this is computational science only. But the protocol — shared logs, peer critique, dead-end registries, team reorganization — is independent of whether the experiments run on a GPU or in a wet lab. Wiring it into real instruments is a separate engineering project, but the coordination piece is, in principle, portable.

23:38Bella: That feels like the right place to land. The show notes have a link to the paper and some further reading if you want to pull on this thread. And if you want the full transcript with technical terms defined inline, plus the concept pages that link this episode to the other multi-agent work we've covered, that lives on paperdive dot AI.

24:00Tyler: Thanks for listening to AI Papers: A Deep Dive.