0:00Bella: There's a lake with almost no fish in it. You drop a line, nothing. So you buy a second rod, then a third, then a fourth — and if the lake is nearly empty, four rods barely beat one. The rods add up slowly; the emptiness works against you on every cast. That lake is what training a reasoning model on its hardest problems actually looks like — and a paper out this May proves, with two lines of arithmetic, that you cannot buy your way out of it. Quick heads up before we go further: this is an AI-made explainer, both voices included.
0:35Finn: And the claim is stronger than "it's wasteful," which everyone already suspected. It's that the waste is structural. For a hard problem, throwing more independent attempts at it is, by the math, almost useless past a certain point. More compute does not save you.
0:52Bella: So by the end of this you'll understand exactly why that's true, and then the genuinely elegant fix — because the same paper takes a fifty-year-old result from combinatorial optimization and uses it to rebuild how you collect training data. And the punchline of that fix is a little outrageous: a trick practitioners have been bolting onto these systems by hand for years turns out to fall straight out of the math.
1:19Finn: The entropy bonus.
1:21Bella: The entropy bonus. We'll build to it. Why should anyone who isn't training these models care? Because reinforcement learning on reasoning models is staggeringly expensive, and this paper shows the waste isn't spread evenly — it's concentrated exactly on the hardest problems, the frontier you most want the model to learn. The next round of gains may come less from bigger models and more from not pouring compute into an empty lake.
1:49Finn: Let's set up the lake properly, because the whole thing hinges on one mechanism. The training method here is GRPO — Group Relative Policy Optimization — the workhorse behind a lot of recent reasoning models. And the elegant idea in GRPO is that it doesn't need a separate judge network to score answers. It judges each attempt relative to its peers.
2:11Bella: Right — so you hand the model a problem, let it generate a batch of attempts at it. The paper calls that batch a group, and each full attempt — the whole chain of reasoning, the tool calls, the final answer — is one rollout. You score every rollout, and the learning signal for any single one is basically: how much better or worse was this than the group average?
2:35Finn: And there's the trap. That signal is a difference. If every attempt in the group gets the same score — all correct, or all wrong — there's no spread. The average equals every individual. So the difference is zero for all of them. And when the signal is zero, the nudge the model gets that step — the gradient, just the size of the update — is exactly zero. That training step taught the model nothing. You spent the compute, you generated the rollouts, and the model didn't move.
3:06Bella: And notice which problems do that to you. Easy ones collapse to all-success. Hard ones collapse to all-failure. Both produce a flat, useless group.
3:15Finn: Which is the quiet tragedy the paper opens with. The model ends up effectively training on a narrow band of medium-difficulty problems — the only ones where attempts still disagree. The easy stuff is solved, the hard stuff is hopeless, and both go silent. The frontier you actually care about goes dark.
3:35Bella: So the obvious response is: fine, on a hard problem the model rarely succeeds, so just collect more rollouts. Buy more rods. Eventually one attempt lands, you get disagreement, you get a signal. That's the intuition this paper sets out to demolish — and here's how it does it. They write down a single number for "did this whole group collapse" — the probability that all your rollouts come back the same way. And for independent sampling, it's almost embarrassingly simple. If the model's success rate on a problem is some small probability, the chance all your attempts agree is just the chance they're all wrong, plus the vanishing chance they're all right.
4:18Finn: And the structural fact buried in that simple expression is the cruel one. The probability of getting a useful, mixed group grows only linearly with your budget. But the difficulty pushes against you exponentially. Budget is a weak lever. Hardness is a strong one. That's the lake — rods add, emptiness multiplies.
4:38Bella: Let me put real numbers on it, because they're brutal. Take a problem the model solves one percent of the time. With sixteen rollouts, at least eighty-four percent of your training groups come back uniform — they teach nothing. So you quadruple the budget. Sixty-four rollouts. And you're still wasting more than half of them.
5:00Finn: Quadruple the compute, still wasting over half. On exactly the problems you most wanted to crack.
5:07Bella: And that's the money chart in the paper — the one figure to burn into memory. Plot the fraction of useful, non-collapsed groups against budget. Flat GRPO — independent sampling — flatlines near forty-five percent and just stops improving. You can pour budget in past eight, sixteen, thirty-two rollouts, and the curve does not climb. It's a ceiling, not a slope.
5:31Finn: And the reason that's such a strong opening move is the proof needs no heavy machinery. It's a binomial calculation and a union bound — the kind of thing you can check on a napkin. This isn't a fragile result hiding behind notation. It's almost rude how direct it is.
5:47Bella: So that's the diagnosis, and it reframes the entire problem. The question was never "how many rollouts should I collect." It's "how informative is the set I collect" — and that's not an engineering preference anymore. That's an optimization problem. Finn, this is where your half of the paper starts.
6:06Finn: It does. And the pivot is the cleverest thing in here. Go back to the lake for a second, because the analogy has a loophole built into it. Independent sampling is like sending sixteen people into sixteen separate rooms to solve a hard problem from scratch. If it's hard, you get sixteen identical wrong answers — sixteen empty casts. But what if they shared a whiteboard? Someone gets halfway to an interesting idea, and instead of everyone re-deriving the easy first steps, you spin off several people to push that one promising half-finished idea in different directions. You spend your effort where the branching actually matters. That's the move: stop sampling independently, and grow a tree of attempts instead. Expand a promising partial solution, branch from it, expand again.
6:54Bella: Okay, but the whiteboard hides the hard part, doesn't it? Somebody has to decide which half-finished idea is worth branching from. Sixteen people staring at a whiteboard still need to agree on where to dig.
7:08Finn: That is precisely the hard part, and it's the whole game. Which node in the growing tree do you expand next? Now — most papers would answer that with a hand-tuned heuristic. Some score, a few terms, tune the weights, ship it. This paper refuses to do that, and the way it refuses is the intellectual core of the whole thing. The technical heart is next, and I want to flag where it lands — because it pays off in a theorem from 1978 that hands them a near-optimal algorithm essentially for free, and then in that entropy bonus we promised. So here's the question they ask: is there a known mathematical structure that captures "a set whose value has diminishing returns as you add to it"? And there is. It's called submodularity.
7:53Bella: Give me the plain version of submodular before you build on it.
7:57Finn: It's more intuitive than the name. A quantity is submodular if adding one more item helps more when you have a little than when you already have a lot. Diminishing returns, formally. The first slice of pizza when you're starving beats the eighth slice when you're full. That's it. That's the whole property.
8:17Bella: And why does anyone care that a quantity has diminishing returns?
8:21Finn: Because of the free lunch attached to it. There's a famous result — Nemhauser and colleagues, 1978, nearly fifty years old — that says: if your objective is submodular, then a greedy algorithm is provably near-optimal. Greedy meaning the dumbest possible strategy: just repeatedly grab whatever gives you the biggest immediate gain, never look ahead, never reconsider. For a submodular objective, that stupid one-step rule is guaranteed to land within about sixty-three percent of the best choice you could have made with perfect foresight.
8:55Bella: So if you can prove that "how much this tree of attempts will teach the model" is submodular, you inherit a near-optimality guarantee without having to design a clever algorithm at all. You just go greedy.
9:07Finn: Exactly. The strategy is: prove informativeness is submodular, then let a fifty-year-old theorem do your algorithm design. And to prove it, they build the tree's informativeness out of three ingredients. I'll name them and keep moving — coverage, novelty, and contrast. Coverage asks whether the tree's leaves span the landscape of possible outcomes. Novelty asks whether they diversify which states you've actually visited. And contrast asks how often sibling branches disagree on the final answer.
9:38Bella: And each of those three has diminishing returns on its own.
9:42Finn: Each one individually, yes — and the meaning that matters is just this: if each piece has diminishing returns, the whole combination inherits it. So the full objective is submodular, and the guarantee applies. Now — Bella, this is the foreshadow I want on the record, because it comes back to bite. What they prove is submodular, and what they have the sixty-three percent guarantee on, is this score. This proxy. Whether that score actually tracks how much the model learns is a separate question, and they answer it with a measurement, not a proof. Hold that thought.
10:17Bella: Noted — proven about the score, measured about the learning. We'll come back to it. But first I want the payoff you teased, because this is the part that made me sit up. You take this submodular objective, and you ask the greedy question — which single node gives the biggest immediate gain in informativeness? What falls out?
10:37Finn: A selection rule. They call it UUCB, and it's just the marginal gain of that objective — the answer to "how much does expanding this one node add." Now, on paper it has five terms, and if I read you five terms your eyes glaze. So don't think of it as five terms. Think of it as a hiring manager deciding which candidate gets the final, expensive interview, asking three questions.
11:01Bella: Three questions.
11:03Finn: One: has this candidate done well so far? That's the value term — how good have rollouts from this node been on average. Two: have I under-tested this candidate, so their thin track record might be hiding something? That's a bonus for nodes you've explored less than their neighbors. And three: is the outcome here genuinely up for grabs, rather than obvious? Then subtract what the interview costs you. Promising, under-explored, genuinely uncertain — minus the cost. That's the whole rule.
11:34Bella: And that second question — under-explored, give it the benefit of the doubt — that's not a new invention.
11:40Finn: No, and this is the first satisfying click. That exact term is the upper-confidence-bound rule — UCB — the classic exploration bonus from bandit problems and game-tree search. It's the math behind "give the option you've tried less the benefit of the doubt." They didn't bolt it on. It fell out of the derivative of their objective. The famous exploration rule is just sitting inside the marginal gain, waiting.
12:07Bella: Okay. But that's not the one you've been dangling. Question three. "Is the outcome genuinely up for grabs." What's the math version of that?
12:16Finn: The math version of "genuinely uncertain about what to do next" is entropy. How spread out the model's probabilities are over its next move. High entropy means the model is torn — lots of live options. And measuring how torn the model is, at a node, is the third question.
12:34Bella: And here's the thing people need to feel the weight of. The entropy bonus — rewarding the model for being uncertain, for keeping its options open — is one of the oldest folk remedies in this entire field. Practitioners add it by hand. It's a tuning knob. A hack that "just works," and nobody could quite tell you why from first principles.
12:55Finn: And in this derivation, it isn't a hack. You write down "I want informative rollouts," you take the greedy marginal gain, and the entropy bonus appears — as a term you're forced to have. It's the chef who always added a pinch of salt to the caramel because it tasted better, and then a food scientist works out the chemistry and shows the salt was never optional seasoning. It was suppressing bitterness the whole time. The recipe is structurally worse without it. The practice was right; what changed is it's now derived instead of guessed.
13:30Bella: In this setup, the entropy bonus stops being a knob and becomes a consequence. That is a genuinely beautiful result.
13:38Finn: It is — and I'm going to make myself say the honest version, because it's my flag from earlier coming due. That clean correspondence comes out of a first-order expansion — a linearization that holds under stated assumptions. So it's "the entropy bonus falls out of the math in this regime," not "the entropy bonus is an eternal law of nature." Strong, real, and conditional. We'll see where the condition strains.
14:04Bella: Fair. Let me make all of this tangible, because so far it's been abstract, and the paper has one worked example that's the best thing in it. This is the part to actually watch on screen. It's a real competition math problem — counting ordered triples of integers fitting some condition. The model takes its first twelve rollouts at it. Ten of them confidently say the answer is two hundred four. One lonely rollout says seventeen — which is the correct answer. And one just gives up partway and aborts.
14:37Finn: So under flat GRPO, what happens to that group?
14:40Bella: Here's the subtle part. It's not a flat collapse — there's a tiny bit of spread, the reward standard deviation is around zero-point-two-seven, so the group technically survives. But the signal is dangerously lopsided. Ten-to-one toward the wrong answer, with one lone correct voice. Under GRPO that single correct rollout gets a big positive advantage against a mostly-negative group — so the whole update ends up dominated by one outlier. You're betting your entire training step on a single sample, which is exactly the brittle, high-variance situation you don't want.
15:17Finn: So it scrapes past collapse, but barely, and on a knife's edge.
15:21Bella: On a knife's edge. Now watch what the selection rule does instead. Picture the tree on screen. UUCB scores every node, and the node it lights up — the one it says is most worth expanding — is not a leaf. It's the exact point partway through the reasoning where the model switches strategy. Where it stops trying to brute-force enumerate every triple and starts setting up generating functions — a real mathematical shift to a completely different approach.
15:51Finn: It found the fork. The pedagogically important fork.
15:55Bella: It found the fork. The single moment where the problem actually turns, and where the model is genuinely uncertain which way to go — high value, under-explored, high entropy, all three questions screaming yes at once. So you branch off that node and push the generating-functions path. And here's the honest outcome — it is not a landslide. Of the four new attempts, one lands on the correct answer, seventeen; two still come back two-oh-four; one aborts. The group still holds more wrong answers than right ones. So it does not win by majority vote. What wins is where the credit lands: that fork is a spot where sibling branches sharply disagree, so the correct branch earns a strong positive advantage right there — and the wrong branches get penalized — and that's what flows back to reinforce generating functions over brute force. Flat GRPO, averaging one nearly-flat group, would have had nothing that sharp to reward.
16:56Finn: And that's the whole pipeline in one problem. The selector didn't just find a right answer hiding in the noise — it found the conceptual branch point and put a sharp, well-aimed learning signal exactly there, where flat GRPO's averaging would have smeared it away. That's a much better story than "we sampled more and got lucky."
17:18Bella: So let's check whether the story survives contact with nine benchmarks. And the cleanest result — the one that goes straight at the theorem — is the gradient signal itself. Remember the empty-lake chart, flat GRPO pinned at forty-five percent useful groups, never climbing? They run their system — they call it InfoTree — against that same axis. A naive fixed tree, branching without the smart selector, already pulls up to around seventy-two percent. And InfoTree tracks the theoretical ceiling to within four points. It closes almost exactly the gap the theorem opened.
17:56Finn: And there's a more direct confirmation, isn't there — they measure the actual nudge.
18:02Bella: They do, and this is the closest thing to proof that the method does what the theory promised. If you're injecting more informative rollouts, the model should get a bigger update each step. So they measure the size of that update directly. InfoTree produces roughly four times the gradient signal per group compared to flat GRPO. Four times the nudge, same budget.
18:26Finn: Predicted by the theory, confirmed on the machine. That's the kind of result I trust, because it's not a leaderboard number, it's the mechanism showing its work.
18:36Bella: And it does turn into leaderboard numbers. The biggest single win is on GAIA — that's the web-search agent benchmark, the model browsing and using tools to answer hard research questions. Plus eleven points over flat GRPO. And the efficiency story is just as good: on competition math, the number of training steps to hit a target accuracy drops from two hundred forty-eight to one hundred thirty-eight. Roughly a one-and-a-half-times speedup — and that's despite the tree-building adding overhead per step.
19:08Finn: Which it does — building a tree is sequential, you expand, then decide, then expand again, and that's slower than firing off independent samples in parallel. They claw most of it back with a speculative trick — workers guess at expansions on slightly stale information and roll back the bad guesses — and it cuts the wall-clock overhead from about fourteen percent down to under five, with accuracy basically untouched. I'll leave the gears in the paper. The point is the speedup survives.
19:39Bella: And there's one result in here that's a real eyebrow-raiser, and I want to set it up properly. We've spent this whole episode saying the selection rule wasn't hand-tuned — it was derived. So here's the natural challenge: if you're so confident in your hand-derived formula, why not just let a neural network learn the selector instead? Throw a small model at thousands of logged expansions, let it figure out the best scoring rule from data. Machine learning's whole promise is that learned beats hand-designed.
20:11Finn: And normally it does. So who wins?
20:14Bella: The hand-derived formula wins. The learned selector scores nearly a point below the analytic UUCB. The closed-form rule that fell out of the math beats the black box trained specifically to beat it.
20:27Finn: Which is a lovely little vindication of the whole thesis. The formula isn't a heuristic the network could improve on — it's the right answer, and the network can only approximate it.
20:39Bella: And only after all that do they pull the object apart to show each piece earns its keep. Knock out the contrast term — the "do siblings disagree" ingredient — and the rate of useful mixed-outcome groups collapses. Knock out the cost penalty and the model starts making about thirty percent more tool calls than it needs. Knock out the depth penalty and the trees grow uselessly deep. No two components fail the same way, which is exactly the argument for keeping all three.
21:09Finn: That's a clean ablation — each piece guards a different door. And there's a small rescue module I'll mention in one breath: for prompts whose tree is about to collapse anyway, a tiny network predicts whether one extra high-temperature shot would flip it from uniform to mixed, and fires only when it's worth it. It rescues the borderline cases — pushing useful groups from roughly three-in-five to three-in-four — for under five percent extra budget. Nice engineering, not the headline.
21:40Bella: So that's the case for the paper. Finn, you've been sitting on the objection since the middle. Time to make it.
21:47Finn: Time to make it. And I want to be precise, because the paper is mostly honest about this — it's the framing around it that leans triumphant. Here's the gap. Everything we just celebrated — the sixty-three percent near-optimality guarantee, the whole "provable" spine — is proven about the proxy score. The submodular objective. Not about the thing you actually care about, which is how much the model really learns, the gradient mass.
22:14Bella: And the bridge between the two is a measurement.
22:17Finn: A single correlation. Around point-eight, between the proxy score and actual gradient quality, measured at one checkpoint, on five hundred prompts. Now — point-eight is a respectable correlation. But think about what it means for the headline. It's like having a map that's certified accurate... of a slightly different city than the one you're standing in. The two cities have been shown to have very similar street layouts. But "the map is provably accurate" and "the map is provably accurate for your city" are different claims, and the distance between them is exactly that resemblance — strong, but measured, not guaranteed.
22:56Bella: So the theory is real, it just doesn't reach all the way to the finish line the abstract implies.
23:02Finn: Right. And it gets sharper, because we can name where the approximation should strain — and the paper's own stress test confirms it. Remember the entropy bonus only falls out under a linearization. That linearization should degrade on deep, narrow trees — long reasoning chains that don't branch until very late. And when they test exactly that regime, the greedy selector's margin over a smarter two-step lookahead shrinks from about one-and-a-half points down to a third of a point. The guarantee is weakest precisely in the regime that hard, long-horizon agentic tasks may actually live in.
23:39Bella: That's the one that gives me pause, honestly. The whole motivation is hard problems, and hard problems tend to be long chains.
23:47Finn: And two more, quickly, on calibration. First, baselines — that eleven-point GAIA win is over flat GRPO, the naive method. Against the strongest tree-based competitor, the margin on competition math is under a point, well inside the noise of some entries. Beating the empty lake by eleven is not the same as beating the field by eleven, and the episode shouldn't let those blur. And second — the entire analysis assumes sparse, verifiable rewards. Right or wrong, plus-one or minus-one, nothing in between. The moment you have richer, graded rewards, the authors concede their advantage shrinks — because graded rewards create disagreement on their own. They do anti-collapse work for free. On their own dense-reward test the margin roughly halves.
24:34Bella: Which is fair, and to their credit they say it out loud. So let me concede the shape of it: the guarantee is on the proxy, the strongest comparisons are against the weakest baseline, and it's all inside the verifiable-reward world.
24:49Finn: And I'll hold there — I don't think the correlation closes that last gap, and I don't think the deep-narrow-tree result is a footnote. It's a crack right where the load is heaviest.
25:01Bella: And yet — even granting every one of those — what the paper establishes still stands, and it's bigger than the method. So let me land the real takeaway, because it's not "InfoTree wins on GAIA." It's the reframe. For years, "all my rollouts came back the same" was a nuisance practitioners grumbled about. This paper turns that grumble into a clean impossibility result — you cannot sample your way out of an empty lake — and then imports a fifty-year-old piece of combinatorics to say what to do instead. And the most satisfying consequence is that a folk-wisdom hack, the entropy bonus, gets a principled derivation. Even if it's conditional, that's the kind of result that changes how people think about the problem, not just how they benchmark it.
25:51Finn: The deeper claim being: scaling up rollout count eventually just stops helping. The next gains don't come from more rods. They come from spending the same budget where the branching actually matters.
26:04Bella: So here's the question for you watching. Should agentic reinforcement learning get smarter about which attempts it collects — the way this paper does — or is the deeper problem the sparse, right-or-wrong reward itself, and we should be moving past that entirely? Pick a side and tell us in the comments.
26:24Finn: The full annotated version of this episode is on paperdive dot AI — every technical term tap-to-define, with links to the related papers grouped by theme, from the original GRPO work to that 1978 submodular theorem, plus our weekly and monthly roundups.
26:41Bella: Quick housekeeping on the way out: this script was written by Anthropic's Claude Opus 4.8, Finn and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "Maximizing Rollout Informativeness under a Fixed Budget," published May sixth, 2026; we recorded this on June twenty-third.
27:03Finn: So if your training run feels like it's burning compute and learning nothing — check whether you're casting into an empty lake. The trick is knowing where the branching actually is.