A Router That Beats the Frontier Models It Calls

0:00Bella: A general contractor walks onto a job site and never picks up a hammer. They don't lay a brick, never touch the wiring. And the house they finish is better than anything the plumber, the framer, or the electrician could have built alone — because their entire skill is knowing who to call, in what order, and who should check whose work. This paper builds that contractor for frontier AI. A system whose only job is to decide which top model to call for each piece of a problem — and it beats every one of those top models on some of the hardest benchmarks we have. On agentic coding, it edges out the best single system by around five to six percent, a relative gain the authors say is normally what you'd expect from a whole new model generation.

0:49Tyler: Quick heads up before we get into it — this is an AI-made explainer, both voices included. And here's the part that should make you pause. Nothing new was trained to do the actual problem-solving. The system never touches the inside of GPT, Claude, or Gemini. It calls them the way you'd call a contractor — describe the job over the phone, hear back the result. By the end of this you'll understand how a model whose only talent is dialing the right number beats the experts it's dialing.

1:20Bella: Which matters far beyond one lab's product. For years the story of AI progress has been one axis — bigger models, more compute, more data. This paper argues there's a second axis hiding in plain sight: how cleverly you combine the models that already exist. And if that's real, the question of who gets to play at the frontier changes completely.

1:43Tyler: Though I want to flag something now, because it's the thread we'll pull hard at the end. The most jaw-dropping wins here lean on numbers the model providers reported themselves, under their own setups. And the one experiment with proper error bars shows the smallest gap of all. Hold that thought — it's the whole credibility question.

2:05Bella: Fair. Let's earn it first. The paper is the Sakana Fugu technical report, out June 19th, 2026, and the starting observation is one anyone who uses these models has half-noticed already. Frontier models have stopped being interchangeable. They've specialized. The paper's read is that the GPT line has become the math-and-physics workhorse, while the Claude Opus line has become the software-engineering and security specialist. And it goes finer than that. Inside competitive coding alone, one model is great at directly implementing an algorithm you already know, while another is better at the messy planning — combining several ideas to crack a problem nobody's solved cleanly.

2:50Tyler: So no single model is best at everything, but each is best at something. That's already an opening. But there's a second thread the paper braids in, and it's the one people underrate. A model's capability isn't just baked into its weights. It depends enormously on the scaffold wrapped around it. A scaffold — think Claude Code, or Codex — is the software harness that turns a raw model into an agent. It hands the model a code repository, lets it run things, read the error messages, edit, and try again in a loop. Same model, different harness, wildly different competence. The intuition: a brilliant programmer with no IDE, no debugger, no way to run their code is a fraction as effective as the same person with a full development environment. The scaffold is the development environment.

3:44Bella: So you've got two levers. Diverse specialists, and the harness you wrap around them. Put them together and you get the thesis: orchestration itself is a scaling axis. A way to push performance forward that doesn't require training a bigger model — just a smarter way to combine the ones you have.

4:04Tyler: And the cleanest way they describe what they're doing is "model merging at the behavioral level." Which is worth unpacking, because the contrast is the whole point.

4:15Bella: Right — classic model merging fuses models by their actual numbers. You average the weights, you stitch layers together, you blend two trained networks into one. It works. The lab behind this paper published its evolutionary model-merging work on exactly that in Nature Machine Intelligence. But it has a hard requirement: you need open access to the model's internals.

4:39Tyler: And that requirement rules out precisely the models that define the state of the art. You can't average the weights of a closed frontier model — you never see them. So Fugu sidesteps the entire problem. It never touches a weight or an activation. It combines models by their behavior — calling them through an interface, watching what comes back, and learning how to route, verify, and stitch their outputs together. Two consequences fall right out of that. It can mix closed models from completely different providers, which weight-merging can never do. And the moment a new model ships, you drop it into the worker pool — no retraining. That's structurally different from a monolithic model you can only improve by training it again from scratch.

5:26Bella: Okay, so that's the philosophy. But the paper ships two different systems, and keeping them straight is the one thing that'll trip people up. Tyler, you've got the cleaner map of these — walk us through it.

5:39Tyler: Happy to. Think of them as two answers to "how much latency will you trade for quality." The first one is just called Fugu, and it's the fast one. For each query — or each turn in a multi-turn task — it picks a single best worker and immediately hands off. So its speed is barely more than calling one model directly. Under the hood it's a small scoring head bolted onto a backbone model: it reads the model's internal state early, outputs one score per worker — basically "who should handle this?" — and dispatches. No expensive writing or reasoning by the orchestrator itself. They call it a decision, not a generation. That design choice matters later, so hold onto it. The second one is Fugu-Ultra, the heavy one. Instead of picking one worker, it writes out an entire workflow in plain language — a sequence of subtasks, each handed to a chosen worker, each with an explicit list of which earlier results it's allowed to see. Because it's free-form, it can express any shape: a best-of-several vote, a chain, or a tree where one model synthesizes the work of two others. It trades latency for quality on the hardest problems.

6:55Bella: And the thing I'd stress — both of these are learned. Neither is a hand-written rulebook that says "coding goes to Opus, math goes to GPT." The orchestrator discovers who's good at what through training, and adapts the structure to the task in front of it. That's the difference between a contractor with a fixed Rolodex and one who actually learned, job by job, which subcontractor delivers.

7:21Tyler: Which brings us to how you teach a thing like that. And this is the densest stretch of the paper — three different training methods across the two systems. I'll give you the intuition for each and what it's actually rewarding; the formal machinery you can let the diagrams carry. The payoff at the end of it is a genuinely surprising fact about what the router learns — something a benchmark score can't tell you.

7:48Bella: Good — because three training methods is exactly where people glaze over. Start with the fast router. How do you train something to just... pick?

7:58Tyler: Two stages. Stage one is supervised fine-tuning on single-step tasks — coding, math, reasoning problems where you know the right answer. For each task they run every worker model several times and measure how well each did. So now you've got a performance ranking over the pool. Here's the clever move. Instead of labeling "model three won, train the router to always pick model three," they convert those scores into a soft probability distribution over the workers — and train the router to match the whole shape of it.

8:33Bella: So it's the difference between teaching a sommelier "this wine is correct, every other wine is wrong" versus "this one's an eight, that one's a seven-and-a-half, this one's a four."

8:45Tyler: Exactly that. When two models are nearly tied, a hard "pick the winner" label throws away how close it was, and it makes the router brittle — it's agonizing over a coin flip. Teaching the full distribution preserves the margin and gives a much richer signal. The router learns the real shape of the quality landscape, not just the peak.

9:08Bella: But you flagged earlier — single-step tasks with clean answers aren't how these models actually get used.

9:15Tyler: That's stage two, and it's where it gets interesting. A clean little task with one right answer tells you nothing about how a model behaves inside a real coding harness — operating tools, editing files, reacting to execution feedback over dozens of turns. So they collect real multi-turn trajectories from environments like Claude Code and Codex, and they optimize the router to maximize one thing: did the whole task get completed? And they do it with evolution rather than gradient-based learning. Picture tuning a recipe where the only feedback is whether the dinner guests cleaned their plates — no per-ingredient notes, just the final verdict. You can't reason backward to "add more salt." So instead you cook many slightly varied versions, keep the ones that got eaten, and brew the next batch of variations from those. That's evolutionary search on a blunt, all-or-nothing signal — which is exactly what a "did the whole job succeed?" reward looks like.

10:16Bella: And it works here because the recipe didn't start from nothing — stage one already gave you a decent draft.

10:23Tyler: That's the key. Supervised fine-tuning put the parameters in a good neighborhood, so evolution only has to refine the seasoning, not invent the dish. They actually report this was more stable than trying to fine-tune directly on the messy end-to-end tasks. And now the surprise I promised. Those end-to-end trajectories reveal something a benchmark can't. A model's standalone score does not predict how well it behaves inside a scaffold. Some models reason beautifully in isolation and then fumble the moment they have to operate tools and react to feedback. Others look unremarkable on the leaderboard but are rock-solid inside an interactive loop. The router learns this practical, in-the-harness notion of capability — which is a different and frankly more useful thing than what any benchmark measures.

11:16Bella: That's a genuinely clean result. The thing you'd put on a slide — the benchmark number — is not the thing that wins real tasks. Okay, that's the fast router. What about Ultra, the one that writes whole workflows?

11:30Tyler: Ultra is trained with reinforcement learning — let it try things, reward good outcomes, let it learn which kinds of attempts pay off. And the reward is brutally simple, in two stages. Zero if what it wrote can't even be parsed into a valid workflow. If it parses and runs, half credit for being well-formed — and full credit only if the final answer is actually correct. The specific method is GRPO, and it has one trick worth knowing. Normally reinforcement learning trains a separate judge model to estimate how good an attempt should have been. GRPO skips the judge entirely. It just compares each workflow proposal against a batch of sibling attempts at the same problem. A proposal gets reinforced if it beat the average of its peers. Cheaper, and it teaches the model "this kind of decomposition tends to work for this kind of problem."

12:26Bella: But there's a hardware-of-the-mind problem lurking in Ultra, isn't there? When you've got multiple agents all able to call tools, something has to keep track of who did what.

12:38Tyler: Right, and they hit a failure mode worth the whole segment. They call it orchestration collapse. Imagine a brainstorming meeting where the first person lays out a complete plan — and then everyone else, having heard it, just nods along and elaborates that same plan. Your "team" of specialists has quietly collapsed into one person's idea. That's exactly what happened when every agent in a workflow could see everything the first agent did. The first one's trajectory railroaded all the rest. They redundantly walked the path already laid down — which defeats the entire point of having multiple specialists.

13:19Bella: So how do you keep them genuinely independent without making them blind?

13:24Tyler: A careful split, and this is the load-bearing subtlety of Ultra. Inside a workflow, agents are isolated — an agent only sees another's work through that explicit access list, otherwise it sees only its own actions. That keeps them diverse. But across workflows there's a persistent shared memory, so an agent retains context from earlier turns and doesn't waste calls rediscovering facts it already established. Send everyone to separate rooms to think — but keep one shared notebook of what's already known. Isolate enough to stay diverse, share enough to stay efficient.

14:04Bella: Let me pull the spine together before we go to the evidence, because that was a lot. Frontier models specialize. Scaffolds amplify them. So you build a learned orchestrator that combines them by behavior, not weights. The fast one picks a single worker, trained by examples then refined by evolution. The heavy one writes whole workflows, trained by reinforcement to get the answer right. And the punchline running under all of it — what wins real tasks isn't the benchmark score, it's behavior inside the harness.

14:40Tyler: That's the build. Now does the routing actually adapt — or is it just always quietly calling the best model? Because that's the obvious objection, and the answer is the best visual in the paper.

14:53Bella: This is my favorite part. Let's start with the single most concrete result — Terminal Bench. The fast Fugu, remember, picks only one model per turn. And it beats GPT-5.5 on this benchmark. How? Watch the trajectory on screen: it alternates between GPT-5.5 and Claude Opus 4.8 through the solution — and it calls Opus in at specific, critical debugging moments. That's the entire thesis in one picture. Per-step routing extracting more than any single model could give on its own. And the heavy Fugu shows the same instinct as an actual workflow. There's a Terminal Bench task — build a working package server. Ultra has GPT build it, then deliberately brings in Opus to find the flaws. And Opus tears into it: it catches that GPT used a plain throwaway server instead of a real package server, that the hand-built package was fragile, that the Linux environment was mismanaged — and then the best catch, that GPT's own "the server is reachable!" check was bogus, the signal was coming from an orphaned background process. Relay that back to GPT, and it finishes the job correctly.

16:07Tyler: That's a newsroom. One reporter drafts the story, a sharp editor tears it apart looking for errors. GPT as builder, Opus as debugger. And it's so much more convincing than a benchmark bar, because you can see the division of labor doing real work.

16:24Bella: And it gets better, because the editor-in-chief seat moves. On a hard trivia question — a game-state puzzle from Humanity's Last Exam — Ultra built a tree with two models each attempting independently, and put Gemini at the top as the aggregator. Both attempts were partially wrong, but the aggregator stitched the correct pieces into a right answer. Now contrast a math problem: it put GPT at the root instead, and GPT-as-aggregator resolved a disagreement over a single integer by rederiving it from first principles.

16:59Tyler: So the synthesizer role itself adapts — Gemini for trivia, GPT for math. And that's precisely what fixed multi-agent systems can't do. The prior work in this space — Mixture-of-Agents, GPTSwarm — uses one fixed model as the aggregator, which bottlenecks the whole system on tasks outside that model's strength. Fugu's adaptive aggregator is its direct answer to that limitation.

17:27Bella: Which sets up the keystone visual — Figure 5. For each benchmark, there's a pie chart showing which worker the orchestrator picked most often. And the punchline is that the pies look completely different across benchmarks. Math leans one way, chemistry leans another, coding another. If the system were secretly just always calling its favorite model, every pie would look the same. They don't. That's the empirical proof it learned domain-specific routing rather than picking a house favorite.

18:03Tyler: And now the headline numbers — but let me frame them as the prediction first. If orchestration genuinely composes specialists, you'd expect it to top benchmarks that reward different strengths, not just one. And it does. SWE-Bench Pro: 73.7, against the best single model's 69.2. GPQA-Diamond: both Fugu variants hit 95.5, beating Gemini's 94.3 — and beating the unreleased next-generation model class the providers haven't even shipped. The authors' framing is that orchestration is reaching performance "typically associated with the next iteration of model training."

18:45Bella: And there are two results I have to mention because they're so vivid. Rubik's cube — write a solver from scratch, standard library only, tested on 300 scrambles. Both Fugu models solve all 300. Of three frontier baselines, one solves all 300 — and two crash before solving a single cube. Ultra's solutions average 19.72 moves, within one move of God's number, the proven worst-case optimum of 20. The line that sticks: producing a solver that actually runs is exactly where two of the three frontier models fail.

19:23Tyler: And the most beautiful one isn't a benchmark at all. Classical Japanese letters, written in a scattered style where characters are deliberately spread across the page at different sizes, and you have to recover the reading order. No dataset exists. They hand-annotated 25 pages with a domain expert. And the framing is the part I'd read nearly verbatim: this is exactly the regime where data-driven learning does not apply — not because a learned model would be weak, but because the training data does not, and cannot readily, exist.

19:59Bella: So Fugu doesn't train a model on it. It writes a predictor — code that reasons over the character positions — and improves that procedure through search. It scores about 0.78 against the best frontier baseline's 0.64. That's the clearest illustration in the whole paper of something orchestration-plus-code-synthesis can do that training simply cannot. When there are no examples to learn from, you reason out a procedure instead.

20:27Tyler: Okay. This is where I have to be the skeptic, because the channel's whole identity is naming where the impressive thing is actually weak — and there's a real seam here. Start with the comparison itself. Fugu's scores are self-computed. Nearly all the baseline scores are provider-reported — each provider running its own model under its own optimal harness. But Fugu deliberately uses minimal evaluation harnesses to, in their words, best expose the underlying capabilities. The problem: a minimal harness might handicap a standalone frontier model relative to the rich production harness its provider used to report that model's number. So some of that five-to-six percent coding gap could be the harness, not the orchestration.

21:15Bella: That's a fair hit. Though it cuts both ways — a minimal harness handicaps Fugu's workers too, since they're the same models.

21:24Tyler: True, but it doesn't dissolve the deeper issue. Fugu beats GPT, Opus, and Gemini — and those three are its workers. Beating your own ingredients is the expected outcome of any decent ensemble. The interesting claim was never "does it beat them" — it's the magnitude. And the magnitude rides on exactly that harness question. Then there's the evidence asymmetry I flagged at the top. The marquee qualitative wins — the blindfold chess games, some of the design results — are, by the authors' own admission, selected illustrative examples, not win rates. Honest disclosure, but it means the most vivid evidence is a highlight reel. And the one experiment with proper seeds, error bars, and a real optimization curve — an autonomous research agent improving a training pipeline over 123 experiments — shows the smallest effect in the paper. The authors call it modest themselves: a fraction of a percent.

22:22Bella: So the pattern is uncomfortable. Where the evidence is rigorous, the effect is small. Where the effect is huge, the evidence leans on selected examples or provider-reported baselines.

22:35Tyler: That's the reservation, and I'd add one more. There's no test of whether the router discovered a counterintuitive specialization — something a human wouldn't have hand-coded. Figure 5 confirms the system learned the priors the authors already held: math to GPT, chemistry to Gemini. That's consistent with learning. But the strongest possible evidence — the router finding a non-obvious pairing that beats human intuition — isn't there. So "learning beats a hand-designed routing table" is asserted more than it's isolated.

23:09Bella: I'll concede all of that. The measured gap is genuinely entangled with the evaluation setup, and the cleanest experiment is the most modest. What I won't concede is that it sinks the idea — because the idea doesn't depend on the exact size of the gap.

23:26Tyler: Agreed, and that's the honest place to land. The framing survives even if some of the numbers are softer than the headline. Capability as a property of systems and scaffolds, not just weights — that's portable, and it's right.

23:40Bella: Which brings us to why this lands beyond one lab. The near-term payoff: you can get something close to next-generation performance today by composing this generation's models well — without the capital, the compute, or the data to train a frontier model yourself. A team that can't build a GPT-5.5 could still field a system that beats GPT-5.5, just by being smart about orchestration.

24:04Tyler: And that reshapes a strategic assumption the whole field has been resting on. A lot of current thinking takes for granted that frontier capability is gated by access to enormous training runs — which is the logic underneath export controls and the entire compute race. The paper notes Fugu delivers frontier capability without that exposure, and floats that treating orchestration as a first-class scaling axis could distribute the benefits of frontier AI more broadly, rather than concentrating them in whoever can afford the biggest training run.

24:39Bella: You don't have to fully buy that to find it consequential. If capability can be amplified by combining models — not only by training larger ones — then the strategic shape of the entire race shifts. And that's the real takeaway, bigger than either Fugu. The durable result here isn't a router or a workflow composer. It's the reframe: the next jump in capability might not come from a bigger model at all, but from a smarter way of conducting the models we already have. The contractor who never picks up a hammer can still build the better house.

25:15Tyler: So here's the question for you. Is orchestration a genuine new scaling axis — a second path to the frontier that doesn't need the biggest training run — or is it a clever way to squeeze the last few points out of models that still had to be trained the old, expensive way first? Pick a side and tell us in the comments.

25:38Bella: If you want to go deeper, the full annotated version of this episode is on paperdive dot AI — every technical term tap-to-define, with links to the related work grouped by theme, from evolutionary model merging to the multi-agent systems Fugu defines itself against.

25:57Tyler: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8; Bella and I are both AI voices from Eleven Labs; and the producer isn't affiliated with Anthropic or Eleven Labs. The paper is the Sakana Fugu technical report, published June 19th, 2026, and we recorded this on June 23rd.

26:18Bella: Whatever wins the next round — a bigger model or a smarter conductor — the trick might just be knowing exactly who to call. We'll see you in the next one.