When RL Actually Teaches Agents Something New, And When It Doesn't

0:00Bella: Here's the result that wasn't supposed to be possible. Take a seven-billion parameter language model. Take that same model after reinforcement learning. Hand both of them a research question that requires looking up two things in sequence — what's the nationality of the director of this film, that kind of thing. Give each model exactly one shot. The base model wins, slightly. Thirty-six percent versus thirty-four. So far, nothing surprising. Now give each model sixty-four independent tries and ask whether at least one of them got the answer right. The base model climbs to seventy-seven percent. The RL model climbs to eighty-one. And the gap is still widening at the right edge.

0:44Brooks: That widening gap, Bella, is the exact opposite of what a widely-cited paper said happens when you do this. The pessimistic story from last year — Yue and colleagues at NeurIPS twenty-twenty-five — was that RL doesn't teach a model anything new. On math problems, if you plot the base model and the RL model against more and more attempts, the curves converge. The base catches up. RL just makes you more reliable at things you already kind of knew how to do. This paper says that conclusion was task-dependent — and they built a new measurement tool to prove it.

1:20Bella: The paper is "Does RL Expand the Capability Boundary of LLM Agents? A pass-at-k-T Analysis," from a team at Fudan, the Chinese University of Hong Kong, and Waterloo. It went up on arXiv in mid-April twenty-twenty-six, and we're recording in early May. Quick ground rules before we dig in: this episode is AI-generated. The script came from Anthropic's Claude Opus 4.7, and Brooks and I are AI voices from Eleven Labs. Neither company is involved in producing the show. With that out of the way — the reason task-dependence matters here isn't just that there's a new finding. It's that the authors run what I'd call the cleanest causal experiment on this question I've seen.

2:03Brooks: Right. And before we get to the experiment, we should make sure the prior result is clear, because the whole episode hinges on contrasting it. The Yue paper looked at math reasoning. They asked a basic question: when RL improves benchmark accuracy, is the model now solving problems it couldn't solve before, or is it just solving more reliably the problems it could already occasionally crack? Two very different things, both compatible with "scores went up."

2:33Bella: And the way they distinguished them was elegant. Imagine a student who's getting a C on a hard exam. Two roads to a B. You could make the student more reliable — same knowledge, fewer careless errors, consistently lands the problems they actually understand. Or you could expand what they know — teach them new material, so problems that were previously impossible become reachable.

2:58Brooks: Both raise the grade.

2:59Bella: Both raise the grade. But if you give the first student unlimited tries on each problem, they eventually approach a ceiling — the ceiling of what they actually understand. If you give the second student unlimited tries, the ceiling itself has moved. That's the test. Pass-at-k at large k is your unlimited-tries check. If the curves converge — base catches up — RL was efficiency. If the curves diverge — RL pulls away — RL was capability.

3:28Brooks: And on math, Yue's curves converged. Cleanly. At sixty-four samples, at two-hundred-fifty-six, the base model basically tied the RL model. That's where the pessimistic reading came from: every problem the RL model could solve, the base model could also solve, given enough rolls of the dice. RL was sampling efficiency in a fancy outfit.

3:50Bella: So this paper's authors look at that result and ask a different question: does it hold when the model isn't doing math, but acting as an agent? Because agents have something math problems don't. They get to interact. Search, read, search again based on what they read. Compose actions across multiple rounds.

4:10Brooks: And that's where re-sampling stops being a substitute. Bella, this is the hinge of the whole paper, right? The intuition is that some tasks have an internal structure that more attempts just can't recover.

4:23Bella: Exactly. Think of it like a treasure hunt where every clue is in a sealed envelope, and you can't open envelope two until you've read envelope one. No matter how many friends you send out at the same time, none of them can open envelope two without first opening envelope one. Sequential dependence. You can't parallel your way around it.

4:45Brooks: And HotPotQA — the multi-hop benchmark this paper uses — happens to have two flavors of question that map exactly onto this distinction. Comparison questions: "Who was born earlier, Einstein or Darwin?" Two facts, but you can look them up in any order, independently. Bridge questions: "What's the nationality of the director of this film?" The second query literally cannot be formed until you've read the answer to the first.

5:12Bella: Bridge questions are the sealed envelopes. And that's where the standard pass-at-k metric breaks down. Pass-at-k treats every attempt as a black box — one shot, in, out. It can't see whether the model got to chain steps or not. So the authors propose a metric with two axes instead of one. Call it pass-at-k-T. K is what it always was — how many independent attempts. T is new — how many rounds of interaction the model gets per attempt.

5:40Brooks: Two boundary cases anchor the definition. At T equals zero, the metric collapses back to ordinary pass-at-k. At k going to infinity — unlimited attempts — you're asking the cleanest possible question: can this model ever solve this problem at depth T? That's the capability boundary. Capability expansion gets a precise meaning: there exists a problem the RL model can eventually solve at depth T that the base model cannot solve at depth T no matter how many times you re-roll.

6:11Bella: And to estimate that honestly without running infinite rollouts, they borrow the unbiased estimator from the original Codex paper. The math is hypergeometric — we don't need it. The intuition is just: I ran sixty-four attempts, here's what would happen if I ran any smaller k of them. You can compute that without bias.

6:33Brooks: OK, so we have a metric. What did they actually do with it?

6:37Bella: They set up a head-to-head with one base model — Qwen-two-point-five seven-billion-parameter instruct, the foundation a lot of agentic RL work uses — and three categories of task. Category A is pure math, MATH-500, no tools. That's a negative control: if RL on tool-use somehow makes you better at math too, we want to see it. Category B is comparison questions. Category C is bridge questions. Two-hundred training problems, hundred test problems per category, one search tool over a small text corpus.

7:11Brooks: And here's the part of the design that's load-bearing. They train two variants of the model on the same two-hundred problems — but with two different learning signals. One variant gets supervised fine-tuning on expert demonstrations: literal worked-out solution trajectories built from HotPotQA's gold supporting-fact annotations. So SFT sees, in detail, how an expert chains the searches and reasons over them. The other variant gets reinforcement learning — GRPO, the algorithm from the DeepSeekMath paper — and the only signal it gets is binary correct or incorrect on the same two-hundred problems.

7:52Bella: Identical training data. Different feedback structure. SFT is "watch and copy." RL is "try stuff, keep what worked."

8:00Brooks: It's like identical twins given different coaches. Same starting ability, same two-hundred practice problems. One twin gets shown the worked solutions and studies them. The other twin only gets told "right" or "wrong" after each attempt. If they end up with different skills, it can't be because they saw different material — it has to be the kind of feedback they got. That's the causal teeth.

8:28Bella: Right. And that design makes the asymmetric outcome the paper finds genuinely informative. Three categories, three models — let's walk the results.

8:38Brooks: Math first, since it's the negative control. RL on tool-use does effectively nothing on MATH-500. The RL model and the base model are within noise across all k and T. So tool-use RL doesn't accidentally make the model better at parametric reasoning. Sanity check passed, move on.

8:58Bella: Comparison questions — the easier of the two multi-hop types — both SFT and RL help modestly. The picture you'd expect. Both training signals work, the curves all sit above base, gaps roughly stable as k grows.

9:13Brooks: And then bridge questions. This is where everything happens. Bella, the headline number here is the one to land.

9:21Bella: On the hundred bridge test questions, the RL agent solves five problems that the base model cannot solve at any of sixty-four tries with five rounds of search. And it loses one problem the base model could solve. Net plus four. SFT — same two-hundred training problems, just trained by imitation instead of reward — gains three new problems and loses seven.

9:47Brooks: Loses seven.

9:47Bella: Right — seven that the base model used to handle on its own. SFT, on the hardest tasks, made the agent worse than not training it at all.

9:57Brooks: Let me sit with that for a second, because it's the result that shouldn't happen. If you take a base model and you show it expert demonstrations of exactly the task you want it to do, the floor expectation is "no harm, maybe a little help." Showing the model how shouldn't be a regression. And on the same two-hundred problems where binary reward made things better, watching expert solutions made things worse.

10:25Bella: And the asymmetry between the two trained models is even sharper than the comparison to base. RL solves nine bridge problems that SFT cannot. SFT solves only one that RL cannot. Nine to one. Same training data.

10:38Brooks: That's the cleanest number in the paper. It tells you something specific is happening with the learning signal — not the data, not the base model, not the tool. The signal.

10:49Bella: Now, the curve shape on bridge questions is the visual that makes the capability-versus-efficiency distinction pop. At k equals one — single shot — the base model is at thirty-six percent, RL is at thirty-four. RL is slightly behind. At k equals four, the curves cross. By k equals sixty-four, RL is at eighty-one, base is at seventy-seven. And the gap is still widening at the right edge.

11:15Brooks: It's like two runners on a track. One starts ahead. The other has a finishing kick. Except the finishing kick here isn't time — it's sampling budget. On math, the runner who started ahead — the RL model at k equals one — gets caught by the base model as the budget grows. They cross the line together. On bridge, the relationship inverts and the gap stretches.

11:38Bella: That widening-as-k-grows is what capability expansion looks like measured directly. The base model is not getting there with more rolls of the dice. The RL model is genuinely past a boundary the base model is stuck behind.

11:52Brooks: OK, so we have the result. The next question is: what is RL actually doing differently? Because if SFT on expert demonstrations makes the model worse, and RL on the same two-hundred problems makes it better — without ever showing the model an expert solution — something specific has to be going on under the hood.

12:15Bella: And the mechanism story is the part I find most satisfying, because it doesn't say "RL discovered new search strategies." It says almost the opposite. RL is reweighting, not replacing.

12:28Brooks: Unpack that.

12:28Bella: Think of the base model as a chef with a repertoire. Some recipes for solving these bridge questions are good, some mediocre. The base model has the whole repertoire, but uses them in some default mix. RL doesn't teach the chef new recipes. It observes which recipes happen to lead to dishes the diners — meaning the verifier — actually like, and shifts the chef's habits to use those recipes more often. Same recipes. Different distribution.

13:00Brooks: And SFT?

13:00Bella: SFT is like firing the chef and replacing them with a new one who only knows the three recipes the expert demonstrator showed. If those three recipes are perfect for the test, fine. If the test rewards flexibility, you've just lost forty-seven recipes for the price of three. On bridge questions, that's roughly what seems to happen.

13:24Brooks: And the paper has receipts for this story — three diagnostics that all point the same direction. Bella, walk through them.

13:32Bella: First one is strategy diversity. You take each model, run sixty-four attempts on a given bridge problem, and count how many distinct query sequences the model produces. Base model: about forty unique sequences out of sixty-four — high diversity. RL model: forty-five. Slightly more diverse than base, not less. SFT model: fifteen. Three-times collapse.

13:57Brooks: SFT collapsed onto a narrower set of strategies. RL didn't.

14:02Bella: Right. Second diagnostic is a perplexity probe. You take the trajectories the RL agent produces — the search queries, the reasoning over the retrieved paragraphs — and you ask the original base model how surprising it finds those tokens. Low perplexity means "I would have generated this." High means "I would not have."

14:23Brooks: And the punchline?

14:24Bella: The split is what's interesting. On the search queries themselves — the actual things the RL agent types into the search box — the base model finds them mildly surprising at most. Roughly the kinds of queries the base model would have written. But on the reasoning over what comes back from the search — the part where the model integrates the retrieved paragraphs and works out the answer — the base model finds those tokens substantially more surprising. The novelty isn't in the searching. The novelty is in what the model does with what it finds.

15:00Brooks: That's a more specific claim than I expected. RL didn't teach better searching. It taught better reading.

15:07Bella: Or more precisely: RL preserved the base model's existing ability to search well, and reweighted toward search trajectories whose downstream reading happens to nail the answer. And the third diagnostic backs this up — they look at how many of the trained models' query sequences are entirely novel from the base model's perspective. For SFT, almost ninety-eight percent of its query sequences never appear in any base-model trajectory. The base distribution has been replaced. For RL, only about eighty-four percent — meaning roughly one in six of RL's queries are still recognizably base-like. RL stayed inside the base distribution and shifted weight within it.

15:50Brooks: Reweighting versus replacement. That's the line.

15:54Bella: And the version of this I keep coming back to is the paper's own phrasing — that RL has "effectively traded efficiency for capability: giving up a small amount of single-shot reliability in exchange for the ability to solve additional problems that lay outside the base agent's capability set at any sampling budget." That's the curve crossing. Slightly worse at one shot. Materially better at many.

16:20Brooks: OK. Now I want to push, because the paper is more honest about its limits than most, and a couple of the caveats are not trivial.

16:28Bella: Go ahead, Brooks — what's the strongest skeptical case?

16:32Brooks: First one is scale. We are talking about two-hundred training problems. One seven-billion model. One retrieval tool over a tiny corpus. Roughly fifty-five hours on a single GPU for the full RL run. This is a clean methodological demonstration, not a definitive empirical claim. The authors say so explicitly. If you scaled this to a fourteen-billion or seventy-billion model, with web-scale retrieval and a heterogeneous tool ecosystem, would the qualitative picture survive? We don't know.

17:04Bella: That's fair. And the SFT regression specifically might be partly a small-data overfitting artifact. Two-hundred expert trajectories is not many for sequential composition. A larger expert set might rescue SFT. The paper can't speak to that.

17:20Brooks: Second concern is the temperature confound, and this is the one I think is most worth flagging. Every evaluation in the paper runs at temperature zero-point-seven. RL training is known to lower a model's output entropy — the trained model becomes more confident, more deterministic. That mechanically inflates pass-at-k at low k, and can deflate it at high k by reducing diversity. The authors do not run a temperature sweep. They acknowledge it as the most urgent missing robustness check.

17:52Bella: Which means some fraction of the curve-shape difference might be entropy effects rather than capability boundaries.

17:59Brooks: Exactly. The most aggressive version of the skeptical reading is: "you found an entropy effect, and you're calling it a capability boundary." I don't think that's the right reading — the matched-data SFT-versus-RL asymmetry is hard to explain by entropy alone, because both are trained variants and SFT also reduces entropy — but until the temperature sweep is run, the alternative explanation isn't cleanly closed.

18:25Bella: Brooks, what about the mechanism diagnostics — are those airtight?

18:29Brooks: Mostly. The strategy-diversity counts and the perplexity probe are pretty robust. The piece I'd flag as weakest is what they call the cross-policy swap — they tried to causally attribute the gain by mixing one model's search trajectory with another model's final reasoning, to see which piece is doing the work. The result they report is a mild tilt toward the retrieval planning mattering. But all four conditions in that experiment cluster within the binomial noise band on the available sample size. So the causal-attribution piece of the mechanism story is the thinnest leg. The perplexity and diversity diagnostics are doing more work.

19:09Bella: And the effect sizes on the headline result?

19:11Brooks: Modest in absolute terms. Net plus four problems out of one hundred. The bootstrap confidence interval on the capability-expansion statistic is something like two to eight problems — well above zero, but a wide band around the five-to-one point estimate. The narrative — "RL expands the capability boundary!" — is doing more rhetorical work than the magnitudes warrant on this one model and benchmark.

19:36Bella: And bridge questions specifically might be a regime that uniquely rewards what RL is doing here. HotPotQA bridge questions have a fairly stereotyped structure — the bridge entity is usually extractable with light reasoning over the first paragraph. So "RL teaches better integration of retrieved text" is exactly the affordance the benchmark rewards. Whether this generalizes to other compositional task families is genuinely open.

20:05Brooks: All of which is to say: the claim is not "RL definitely expands capability for agents in general." The claim is "the pessimistic conclusion from math reasoning does not transfer to even this one tool-use setting, and we now have a measurement framework that can detect the difference." That's a narrower claim than the title suggests, but it's still a real one.

20:28Bella: And it's a useful reframing of the broader debate. The Yue paper isn't wrong. Their result holds — on math reasoning. This paper is also right — on bridge questions. The two findings are consistent if you accept that RL is a reweighting mechanism that lands different places depending on what the task structure rewards. On math, the base distribution already contains enough mass on correct reasoning chains that reweighting doesn't extend the boundary. On compositional tool use, the base distribution contains a long tail of strategies whose downstream reasoning would solve the problem if only it got picked more often — and that's exactly the regime where reweighting can land mass on genuinely new solutions.

21:14Brooks: Same mechanism. Different observable. Depending on task structure.

21:18Bella: That's the reconciliation. And the practical implication for people building agents is the part that translates immediately. If you're choosing how to spend a training budget on a compositional tool-use task — collecting expert demonstrations for SFT, or setting up an RL pipeline with verifiable rewards — those two roads were not as substitutable as they looked. On the same two-hundred problems in this paper, the RL road materially outperformed the SFT road. And SFT actively hurt on the hardest examples.

21:51Brooks: That's a real recommendation, even at this scale. With the obvious asterisk that this is one base model on one benchmark, and we should expect the picture to refine as the work scales up.

22:02Bella: The deeper contribution, I think, is the framing. Capability isn't one-dimensional for an agent. It's parametrized by how many tries and how deep each try is allowed to go — and those two axes interact in ways pass-at-k alone can't see. That move — saying out loud that we need a two-axis metric for compositional behavior — is going to outlive this specific seven-billion result.

22:27Brooks: And it's the move that lets the paper tell a coherent story instead of just an empirical one. Without that framework, "RL helps on bridge questions" reads as a narrow benchmark win. With it, the same finding reads as: here's a regime where the prior consensus story has a measurable gap, and here's the tool that lets you see it.

22:48Bella: The show notes have a link to the paper and related materials, if any of this caught you. Thanks for listening to AI Papers: A Deep Dive.