Why Parallel Sampling Plateaus, And What Evidence Graphs Do Instead

0:00Cassidy: Here's a result that I think captures where AI research agents are stuck right now. You take one of these web-browsing research agents — the kind that goes off, searches the web, reads pages, and comes back with an answer — and you run it on a hard question. It scores, let's say, somewhere in the forties. So you do the obvious thing: you run sixty-four of them in parallel, take a majority vote, and you expect a big jump. You get a small one. Run more. Smaller. The curve flattens.

0:32Tyler: And the reason it flattens is the unsexy one. Sixty-four agents asked the same question tend to think the same wrong thoughts. They sample from the same distribution, they get distracted by the same red herrings, and a vote across correlated mistakes doesn't unmake the mistake. Today's paper is a pretty serious attempt to fix that — and the fix is architectural, not just bigger models.

0:57Cassidy: The paper went up on arXiv on May fifteenth, twenty-twenty-six, and we are recording three days later. What you're hearing is an AI-generated deep dive — I'm Cassidy, that's Tyler, we're both AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. Neither company is involved in producing the show. The paper is "Argus: Evidence Assembly for Scalable Deep Research Agents" from a team at MiroMind AI — and we'll come back to that affiliation later, because it matters for how you read some of the numbers.

1:32Tyler: The headline result is one of those satisfying ones where a number tells the whole story. Argus, scaled out to sixty-four parallel search agents, manages a twelve-hundred to one compression ratio between what those agents produce and what the coordinator actually reads. Twelve hundred to one. That's the move that unlocks scaling — and the rest of the paper is essentially explaining why that ratio is even possible.

2:00Cassidy: Let me give the picture in one frame and then we'll unpack it. The standard parallel approach — sample many answers, vote — is like asking sixty-four people to each guess what the completed jigsaw puzzle looks like and then picking the most common guess. You get redundancy, not coverage. Argus does something different. It splits the system into two roles: a swarm of Searchers, which are the agents that actually go out and browse the web, and a single Navigator, which is the coordinator. The Searchers each work on a piece of the puzzle. The Navigator looks at the assembled pieces and figures out what's still missing.

2:40Tyler: And the thing the Navigator does NOT do is read the Searchers' raw output. That's the key. If you let the Navigator read everything sixty-four browsing agents pulled down off the web, the context window detonates. So instead, every Searcher's output gets compressed into a structured object — an evidence graph — and the Navigator reads the graph.

3:03Cassidy: Right. So think of the Searchers as field reporters out chasing leads, and the Navigator as the editor back at the desk. The reporters file structured summaries — claim, evidence, source, whether the evidence supports or refutes. The editor never reads the raw transcripts. They look at the filed summaries, notice gaps — "we don't have a second source on this one" — and dispatch follow-up assignments. The editor's whole job is knowing what's missing.

3:32Tyler: The structure that lets the editor see those gaps is worth pausing on, Cassidy, because it's where a lot of the paper's craft lives. They call it the evidence DAG — directed acyclic graph — but the picture I want listeners to hold is a detective's corkboard. You've seen the shot. Photos pinned up, index cards with claims on them, red string running from a piece of evidence to a claim it refutes, green string from evidence to a claim it supports. The detective doesn't re-read every interview transcript every time they look at the wall. They look at the structure of strings.

4:10Cassidy: And the typed strings are doing real work. A red string means something different from a green string. If you just had a pile of notes — "here are sixteen things we found out" — the Navigator would have to parse and re-derive the contradictions every time. With the typed edges, contradictions are visible as graph structure. Two pieces of evidence pointing at the same claim with opposite labels — that's a flag the Navigator can pick up on without re-reading anything.

4:41Tyler: So every time a Searcher comes back from a browsing session, what it actually returns is not a wall of text. It's a set of nodes and edges. New claim nodes. New evidence nodes. Edges typed as supports or refutes. That graph fragment gets merged into the shared evidence graph that the Navigator is working from. And the merge is what produces the compression. Sixty-four agents browsing the web for ten minutes each generate something like twenty-five million tokens of raw page content in their accumulated outputs. The graph the Navigator actually reads is around twenty-one thousand tokens. That's where twelve hundred to one comes from.

5:23Cassidy: There's a frame I want to use here, because it'll help when we get to the training section. The Navigator is reading the index and the table of contents of a twelve-hundred-page report — not the report itself. The Searchers wrote the report. The graph is the index. And the Navigator's job is specifically the job that can be done from the index: where are the holes, what's contradicted, what claim has only one source, which thread hasn't been pulled on.

5:53Tyler: Which gets us to what the Navigator actually decides. At each step, it can do one of three things. It can dispatch another Searcher — typically with a very targeted assignment, something like "verify whether this specific person held this specific position in this specific year." It can run several Searchers in parallel if there are multiple gaps. Or it can decide the graph is complete enough and commit to a final answer. The paper frames this as three operating modes, which I think of as three gears. First gear: just a solo Searcher, no Navigator. Second gear: Navigator plus one Searcher at a time, serial verification. Third gear: Navigator plus many Searchers in parallel. The accuracy curve as you shift up through those gears is where the headline numbers come from.

6:45Cassidy: And on BrowseComp — which is the hardest of the web-research benchmarks they evaluate on — the accuracy keeps climbing as they scale the Searcher count up to sixty-four. Log-linear, basically. What's striking is the second curve on the same plot: the Navigator's reading pile grows from a few hundred tokens at the low end to about twenty-one thousand at sixty-four Searchers. That sounds like a lot until you remember the Searchers themselves are producing twenty-five million tokens over the same range. Searchers scale by orders of magnitude, the Navigator's context grows modestly. That decoupling is the actual win. You're not buying accuracy with context length anymore.

7:30Tyler: Tyler's pushback time. Cassidy, this is the moment where I want to be careful — because what you've described so far is an architecture, and architectures by themselves don't produce well-behaved Navigators. You can't just write a prompt that says "look at the graph and figure out what's missing" and expect the model to do it well on hard questions. The Navigator has to be trained. And the way they train it is the second genuinely interesting idea in the paper.

7:59Cassidy: Yeah, take it. This is your thread.

8:01Tyler: So the obvious training signal is: did the system get the right answer in the end? Reward the Navigator when the final answer is correct, penalize when it's wrong. Standard outcome reward. The problem with that — and the authors call this out clearly — is that it rewards the Navigator for things it didn't cause. Sometimes a question is easy enough that a single Searcher would have gotten it right with no verification at all. If the Navigator happens to be in the loop and the answer comes out correct, you reward it — but the Navigator didn't actually do anything. You're just teaching it to take credit.

8:39Cassidy: That's the proofreader problem.

8:41Tyler: Exactly. Imagine grading a proofreader by reading the published article and giving them a bonus if it was good. You'd be rewarding them for articles that were already good before they touched them. The fix, in publishing and in this paper, is the same: compare the article with and without their edits. If their edits flipped errors into corrections, they earned their pay. If the article would have been fine either way, the edits were noise.

9:08Cassidy: And in the paper this becomes what they call a contrastive reward. For every training example, they run the system twice. Once with the Navigator doing its full verification loop. Once in a "shadow pass" where the Navigator still synthesizes an answer, but only from the pre-verification graph — the evidence that was on the table before any of the Navigator's verification dispatches got added. So it's not "no Navigator at all" — the Navigator still reads the graph and produces an answer. It just doesn't get to see the fruits of its own verification work. And the reward isn't "did we get it right." The reward is "did the verification stage move the answer toward correct."

9:51Tyler: Which means the Navigator gets positive signal only when its verification mattered. If the pre-verification graph was already enough to land the right answer, the Navigator learns to stay out of the way. If the pre-verification graph would have produced a wrong answer and the verification dispatches flipped it to right, big reward. If the verifications didn't change anything, no reward — wasted effort. It's a much sharper training signal than outcome accuracy.

10:21Cassidy: And on the RL side, the actual optimization is GRPO — group relative policy optimization — which I don't want to spend time on because the listener doesn't need it. The intuition is just that the model generates several candidate trajectories for the same question, the contrastive reward scores each one, and the better-than-average trajectories get reinforced. There's a clip and a KL penalty in there to keep the policy from drifting, but that's plumbing.

10:50Tyler: The thing I want to flag, though, is that the contrastive reward is doing something subtle and worth appreciating. Most outcome-based RL on agents has this credit assignment problem baked in — you can't tell which of the agent's twenty actions actually mattered. The shadow-pass trick gives you a counterfactual for the verification stage as a whole. It doesn't tell you which specific intervention mattered, but it tells you whether the Navigator's verification work on this question was load-bearing. That's a much cleaner signal than you usually get.

11:26Cassidy: I want to ground all of this in a concrete example, because the paper has a really nice one in the appendix and it's the moment where the whole machine clicks. The question is asking for the name of a person who satisfies a pretty specific cluster of biographical constraints — career timeline, geography, a few facts about their published work. Standard stuff for BrowseComp.

11:50Tyler: Hard for a single agent because no single page has the answer. You have to assemble it.

11:56Cassidy: Right. The initial Searcher pass goes off, browses around, comes back, and the evidence assembles around the name Jesse Duroha. Sounds plausible. Looks like a real name. Has some surface match with the constraints. If you were running standard parallel voting, several Searchers might have landed on the same name from similar reasoning, and you'd ship it.

12:19Tyler: And it's wrong.

12:20Cassidy: Completely wrong. But here's what happens with the Navigator in the loop. The Searchers' findings get assembled into the evidence graph: claim, "Jesse Duroha is the answer," edges pointing to supporting sources. The Navigator looks at the graph and notices that whole regions of the constraint set haven't been touched — specifically, a particular activity window and some book chapters from an earlier year. So it dispatches a verification batch. Among those probes is an alternative-hypothesis check — essentially, "if Duroha isn't right, who else fits the untouched constraints?" That probe comes back with evidence pointing somewhere else. New node in the graph. Red string against the original answer.

13:04Tyler: At which point the Navigator can't just commit to Jesse Duroha. The graph now has a contradiction on a load-bearing claim, and an alternative thesis with its own supporting evidence.

13:15Cassidy: So it dispatches more searches to nail down the alternative — pointing at a different name: Nicholas Constant. The Navigator dispatches a couple more verifications to check that the new candidate satisfies the rest of the constraints. They come back green. The graph is now consistent. The Navigator commits.

13:35Tyler: And the final answer is Nicholas Constant, which is correct. What I love about that case study — and Cassidy, I think this is the cleanest illustration of why the architecture matters — is that no amount of parallel sampling would have fixed it. Sixty-four Searchers asked the same question would have produced sixty-four near-duplicate trajectories all confidently landing on Jesse Duroha. Voting would have made the wrong answer louder, not more correct. What broke the failure was the Navigator noticing that parts of the constraint set were untouched and dispatching an alternative-hypothesis probe that the Searchers wouldn't have run on their own.

14:15Cassidy: That's the jigsaw versus voting distinction made concrete. Voting can only pick from answers that were already proposed. Jigsaw assembly can detect that a piece is missing.

14:26Tyler: There's one more architectural property of this design that I want to flag, because it has real practical implications. The Searcher and the Navigator are separately trained models — and the Navigator's training is entirely about reading evidence graphs and deciding what to dispatch. It doesn't depend on which specific browser-agent is filling in the graph. So when they take the trained Navigator and pair it with Searcher backbones it has never seen during training — including a couple of proprietary models from other labs, at different scales — the Navigator still works. Zero-shot transfer.

15:04Cassidy: That's a pretty big deal. It means the Navigator is not entangled with the quirks of any particular Searcher model. You can swap in a stronger Searcher tomorrow and inherit the gains without retraining the coordinator.

15:18Tyler: It also tells you something about what the Navigator actually learned. If transfer worked, the policy isn't "I know how Searcher X tends to fail and I compensate." It's something more like "given any evidence graph in this format, here is how to identify gaps and contradictions." That's a more general skill, and the fact that it transfers — even across to closed proprietary backbones — is some evidence the abstraction is real and not just memorized.

15:47Cassidy: Tyler, you flagged at the top that we should come back to the MiroMind affiliation. Now's the moment.

15:53Tyler: Yeah. So the paper is from MiroMind AI, and one of the baselines they compare against is MiroThinker-1.7, which is their own prior model. That's not unusual in this field — labs often benchmark their new systems against their old ones — but it does mean a critical reader should ask whether the comparisons are calibrated against the strongest possible baselines, or against the ones that flatter the new result. The paper does also compare against external systems, and the gains hold up there. But the headline scaling curves are run against their own Searcher backbone. That's a thing to register without making more of it than it warrants.

16:34Cassidy: The steelman of the critique is that some of the architectural cleanliness might be specific to how MiroThinker-style Searchers produce output. If a different Searcher family produced evidence in a messier or differently-typed form, would the Navigator still find the gaps as effectively? The zero-shot transfer experiments push back on this somewhat — they do work, including on proprietary backbones — but the transfer set is finite and the long tail of out-of-distribution Searcher behavior isn't really stress-tested.

17:06Tyler: The other limitation worth voicing — and the authors do gesture at this — is that the evidence graph format has to be agreed on in advance. Both the Searcher and the Navigator have to share the schema for what counts as a claim, what counts as evidence, what edge types exist. That's fine inside one lab, but if you wanted this approach to become a standard, the schema becomes a coordination problem. Different teams might evolve incompatible graph formats and you lose the interoperability that made the zero-shot transfer interesting.

17:40Cassidy: And I want to add one more, because it's the kind of limitation that doesn't show up in benchmark numbers but matters for whether this works in the wild. The contrastive reward depends on being able to run a shadow pass — to synthesize what the answer would have been from just the pre-verification graph. That's tractable during training on known questions with known answers. It's not something you can do at deployment time, when you don't have ground truth. So the training procedure has a counterfactual luxury that the deployed system doesn't. That's fine — most RL training has this property — but it means the quality of the Navigator's behavior in the wild is bounded by how well the training distribution matched the deployment distribution. Hard, weird, out-of-distribution questions might catch the Navigator dispatching verifications that didn't get good gradient signal during training.

18:36Tyler: One more thing I want to add to the limitations pile, because it's the kind of thing that doesn't show up in the headline plot but matters. The Navigator is a single point of serialization. The Searchers parallelize beautifully — that's the whole pitch — but the Navigator is one model reading the graph and making decisions in sequence. As you scale Searchers further, eventually the Navigator's decision-making becomes the bottleneck. The paper doesn't push that far enough to see it bite, but the architecture has that asymmetry baked in.

19:10Cassidy: Which is the kind of limit that future work fixes, probably with a hierarchy of Navigators. But it's worth naming.

19:17Tyler: Worth naming. Cassidy, where does this leave us on the bigger question — what does Argus actually change about how we should think about scaling research agents?

19:27Cassidy: I think the move that matters most here is the reframing of what parallelism is for. The dominant model has been "run many samples and aggregate" — and the implicit theory is that diversity in the sampling distribution will surface the right answer somewhere in the cloud. Argus is saying: that's the wrong unit of parallelism. You shouldn't be parallelizing over guesses at the answer. You should be parallelizing over pieces of evidence, and then doing the assembly serially with a model that knows how to read structure.

20:02Tyler: And the twelve hundred to one compression number is the empirical signature that the reframe is real. If parallel scaling had to be paid for in linear context growth, you'd hit a wall fast. Decoupling the two is what makes the curve keep climbing.

20:19Cassidy: There's a broader pattern here too. For a couple of years now, the dominant philosophy in agent design has been "let one big model figure everything out in its context window." Longer context, more tokens, more in-context reasoning. Argus is on the other side of that argument — it's saying the right move is to put structure outside the model. An evidence graph that persists and gets read in compressed form. A separate model that operates on the structure rather than reconstructing it every step. It's a bet that external scaffolding can buy capabilities that pure context-length scaling can't.

20:59Tyler: Which feels right to me. The papers I'm most interested in over the next year are the ones that take that bet further — what other kinds of structured state could live outside the model? Plans? World models? Evolving hypotheses? Argus is one instance of a design philosophy that I think has more room to run.

21:19Cassidy: Alright. Tyler, anything we missed?

21:21Tyler: Just the one we should leave the listener with. The case study — Jesse Duroha to Nicholas Constant — is the kind of example that sticks because it shows a system catching its own wrong answer, not through more sampling, but through structural self-doubt. The Navigator looked at the graph, saw whole regions of the constraint set that hadn't been touched, and the entire correction cascaded from that one noticing. That's the kind of behavior we want from research agents, and it's the kind that voting will never produce.

21:54Cassidy: The show notes have a link to the paper and some related reading if you want to keep pulling on this thread. And if you want the full transcript with the jargon defined inline and links over to the other episodes that share these ideas, that's all on paperdive dot AI. Thanks for listening to AI Papers: A Deep Dive.