Ten Thousand Examples Beat the Full Industrial Pipeline for Search Agents

0:00Jessica: Here's a number that should not exist. A research team at a Chinese university — no industrial compute budget, no proprietary data pipeline — took an open-weights model, ran a single supervised fine-tuning pass on about ten thousand training examples, and beat a search agent from Alibaba that was built with the full industrial recipe: pre-training on top of pre-training, then fine-tuning, then reinforcement learning. Same model size. Better scores on every benchmark they tested.

0:31Brooks: The paper landed on arXiv on May fifth, twenty-twenty-six — we're recording this the day after. What you're hearing is an AI-generated deep dive: the script is from Anthropic's Claude Opus 4.7, and Jessica and I are AI voices from Eleven Labs. Neither company is involved in producing the show. The paper is called "OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories," and the reason that ten-thousand-example result is interesting isn't that it's a leaderboard upset — it's that it's an argument about what the industrial pipeline was actually doing.

1:09Jessica: Right. And to feel why that's the argument, we need to set up what a "search agent" actually is, because the term hides a lot. Imagine a hard research question — the kind where you'd open ten browser tabs, click through to a paper, cross-reference a date, double-check a name on a third site, and fifteen minutes later you have an answer. A search agent is an AI doing that loop autonomously. It reasons about what to look up, calls a search tool, reads the result, reasons again, calls another tool, and keeps going. The standard pattern is called ReAct — reason, act, observe, repeat. And these loops get long. The system in this paper is allowed up to two hundred tool calls per question. A typical training example involves around sixty-five.

1:57Brooks: Sixty-five tool calls per example. That's the texture of "deep research" worth holding onto — the agent isn't googling once and answering, it's grinding through a research thread.

2:08Jessica: Exactly. And the dominant story for how you train an agent to do this in twenty-twenty-five was: stages on stages on stages. First you continue pre-training the base model on more domain text. Then you do supervised fine-tuning, which is just imitation — show the model good examples, let it copy. Then you do reinforcement learning, where the model tries things, gets scored, and gets nudged toward higher scores. The implicit assumption was that each stage does something the others can't. Pre-training installs knowledge. Fine-tuning teaches behavior. Reinforcement learning polishes the long-horizon reasoning — teaching the model to keep going, to chain tool calls, to not give up when the third lookup doesn't pan out.

2:53Brooks: And this paper's claim, stripped down, is that last assumption is wrong — or at least, much weaker than people thought. The reinforcement learning stage isn't installing some special long-horizon capability that imitation can't reach. It's compensating for the fact that the imitation examples weren't long-horizon enough. Fix the data and you don't need the polish.

3:16Jessica: Which leads to the cleanest piece of evidence in the paper, and Brooks, this is the one I want to start with because it sidesteps a lot of apples-to-oranges concerns. The same team published an earlier version of this system, OpenSeeker-v1. Same base model — Qwen3 at thirty billion parameters. Same training method — pure supervised fine-tuning. Same overall pipeline. Only the training data changed. On BrowseComp, which is the main benchmark for these systems — it tests whether an agent can answer questions that require browsing the web — version one scored twenty-nine point five. Version two scored forty-six point zero.

3:56Brooks: A sixteen-and-a-half point jump from changing nothing but the training examples. So what did they actually do to the data?

4:04Jessica: Three changes, but they're really one idea wearing three different costumes. The idea is: every single training example should require sustained, multi-hop, tool-diverse reasoning. No shortcuts. No quick wins. The first change is to the synthesis pipeline. You can't collect ten thousand real examples of humans doing patient multi-hour research, so teams synthesize them — usually by taking a knowledge graph, which is just a giant network of facts like "this person directed this film," "this film won this award" — picking a starting point, grabbing a chunk of connected facts around it, and asking an LLM to write a question whose answer requires combining those facts. The shape of the chunk you grab determines how hard the question can be.

4:50Brooks: Think of it like asking a chef to invent a recipe. Put three ingredients on the counter and you get a three-ingredient dish. Put twenty ingredients on the counter and the natural recipes become more complex — not because you instructed the chef differently, but because the raw material now affords richer combinations. The team grew the chunk.

5:12Jessica: Right. The second change is about tools. The agent has a toolkit — search, browse, calculator, and so on — and they expanded that toolkit during data generation, so the trajectories the model imitates use a wider variety of tools. An agent that's only ever seen search will lean on search even when something else would work better. Show it varied demonstrations, it learns to mix. And then the third change, which is the one I find most striking. After they generate all these trajectories, they apply a hard filter. Any trajectory that resolved in fewer than some minimum number of tool calls — gone. Thrown out. Doesn't matter if the answer was correct. Doesn't matter if the trajectory was clean. If the agent solved it too quickly, the example is excluded from training.

6:00Brooks: That sounds blunt to the point of being crude. Length isn't the same as difficulty. A trajectory could be long because the agent was inefficient, not because the question was hard.

6:11Jessica: That's a fair pushback and we'll come back to it. But here's the intuition for why the filter does something even if it's a noisy proxy. Think of a tutor preparing a student for a brutal final exam. Conventional wisdom: use a mix of practice problems to build skill across the difficulty range. The paper's move is more like: throw away every practice problem the student solves in under an hour. Only show them problems that take all afternoon. The student never sees a problem get resolved quickly. So when they hit a hard problem on exam day, they don't give up early — patience is baked in by the training distribution itself. The model is doing imitation learning. It's inheriting the character of its demonstrations. If every demonstration you show it is a marathon runner pacing across hours, it learns to pace across hours. The filter makes sure there are no sprinters in the training data.

7:07Brooks: And the trajectory length numbers back this up as a description of what they did. The team's previous system averaged about forty-seven tool calls per training example. A competitor system, RedSearcher, averaged about thirty-six. OpenSeeker-v2 averages around sixty-five. The training data has been visibly pushed harder.

7:28Jessica: That's the recipe. The results are where it gets vivid — and Brooks, you've been sitting on the comparison I think actually lands.

7:37Brooks: Yeah. The headline comparison is against Tongyi DeepResearch, which is Alibaba's search agent at the same thirty-billion-parameter scale, trained with the full industrial pipeline. On BrowseComp, OpenSeeker-v2 scores forty-six point zero. Tongyi scores forty-three point four. On the Chinese version of that benchmark, OpenSeeker scores fifty-eight point one, Tongyi scores forty-six point seven — that's an eleven-point gap, on Alibaba's home turf, against Alibaba's own model. On Humanity's Last Exam, which is a brutal multi-domain expert question set, thirty-four point six versus thirty-two point nine. On xbench, seventy-eight versus seventy-five.

8:20Jessica: Every benchmark. One-third the training pipeline.

8:24Brooks: And then there's a separate result that's its own kind of strange. The thirty-billion-parameter OpenSeeker-v2 also outperforms DeepSeek-V3.1, which is six hundred and seventy-one billion parameters, on BrowseComp. It beats GLM-4.6 at three hundred and fifty-seven billion. It beats Claude 4.5 Sonnet. A thirty-billion model beating a six-hundred-seventy-one-billion model is the kind of result you'd normally distrust on first read.

8:52Jessica: And you should distrust it a bit, Brooks — which is the right segue, because the paper has real soft spots.

8:59Brooks: It does. The most uncomfortable one first. The "fine-tuning alone beats the full pipeline" framing is doing a lot of rhetorical work, but the model OpenSeeker-v2 starts from — Qwen3 thirty billion — is itself the product of a full industrial pre-training pipeline run by Alibaba. So the actual claim isn't that you can build a frontier search agent from scratch with imitation learning. It's that you can do the final adaptation step with imitation learning, on top of a base model someone else already paid the pre-training bill for. That's still meaningful. It's just smaller than the framing suggests.

9:36Jessica: The assembly-line image works, but only for the finishing stations — not the whole factory.

9:42Brooks: Second issue, and the one that bothers me most as a reviewer: there are no ablations on the three data changes. The paper claims bigger graphs, more tools, and the step-count filter together drive the gains — but they never isolate them. We don't know if the filter is doing eighty percent of the work and the other two are decoration, or if it's the tool expansion that's load-bearing. The "three simple modifications" framing is satisfying as a narrative, but the data doesn't tell us which one matters. And given that the paper is making a strong methodological argument — *this* is the axis that matters — you'd really want to know which of these knobs to turn.

10:22Jessica: And the related concern is that all of these strong results come from a single training run on a single dataset. No reported variance across random seeds, no multiple runs, no confidence intervals on the benchmark scores. Some of the gaps the paper celebrates — like the zero-point-three-point edge on Humanity's Last Exam over RedSearcher — are well within the range you'd expect from run-to-run noise.

10:46Brooks: And then there's your earlier pushback, Jessica — length as a difficulty proxy. The filter assumes that trajectories with more tool calls came from harder questions. But a trajectory could be long because the agent was floundering, looking up the same fact three times, going down a wrong path before recovering. Long doesn't mean hard. The paper leans on the proxy without validating it.

11:10Jessica: And finally, benchmark contamination. These search agents are being tested on questions whose answers exist on the open web. The team mentions masking links to certain sites during evaluation to avoid leakage, which signals they're aware. It doesn't fully address it. The line between "the agent reasoned its way to the answer" and "the agent found a page that already had the answer" is fuzzy in ways that matter.

11:36Brooks: None of which kills the result. The v1-to-v2 jump is internal evidence — the contamination story would have to explain why the same team's previous data didn't already capture it. The base-model concern is real but doesn't change the fact that the final fine-tuning stage is where most of the agent character gets installed. So the headline survives. It's just narrower than the abstract makes it sound.

12:01Jessica: And the paper itself is unusually quiet about its own limitations. There's no dedicated limitations section. No discussion of failure modes. No real probing of whether this approach generalizes beyond search agents. The "what's next" section gestures at scaling up data quantity and diversity, which implicitly admits this is a small-scale demonstration. But the silence about where the data-quality story might break is itself worth flagging.

12:29Brooks: So where does this leave us. Two takeaways. The surface one is democratization — a university team with ten thousand examples and standard fine-tuning matched a major industrial lab's full pipeline. The weights are open-sourced. Other academic groups can build on this directly. That changes who can run the next experiment.

12:49Jessica: The deeper one is about what the industrial pipeline was actually doing. The implicit story behind pre-training plus fine-tuning plus reinforcement learning was that each stage installs something the others can't. This paper is a credible existence proof that, for search agents at least, the reinforcement learning stage was largely compensating for fine-tuning data that didn't push the model hard enough. Fix the data, and you may not need the polish. That's a claim about resource allocation across the field — and if it generalizes, it reshapes a research program rather than just adding a row to a leaderboard.

13:28Brooks: Whether it generalizes is the open question. Search agents have a particular structure — long horizons, tool use, multi-hop reasoning — that maps unusually cleanly onto long, diverse demonstrations. It's possible this works because of something specific to that structure, and won't transfer to domains where reinforcement learning is doing different work. We'll see. But the move the field should make next is obvious: someone needs to run the ablations this paper didn't.

13:57Jessica: Thanks for listening to AI Papers: A Deep Dive. The show notes have a link to the paper and related materials. We'll see you next time.