Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward

0:00Juniper: Here's a small, uncomfortable fact about the way we've been training AI agents to use tools. You take a strong language model, you ask it to invent a task — book me a flight, look up this domain, query that database — and then you ask the same model to write out the sequence of tool calls that would solve it. You collect millions of these. You feed them to a smaller model. You ship it. And the entire time, nobody has actually checked that any of those tool calls work, against any real API, anywhere.

0:31Finn: The model is hallucinating both sides of the exam. It's inventing the question, inventing the answer, and there's no graders in the room.

0:40Juniper: Right. And the paper we're digging into today goes after exactly that problem with a move that, once you see it, is hard to unsee. It's called "Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs," it went up on arXiv on May seventeenth, twenty-twenty-six, and we're recording three days later. Quick ground rules before we get into it: this episode is AI-generated, the script is from Anthropic's Claude Opus 4.7, and you're hearing two AI voices from Eleven Labs — I'm Juniper, that's Finn, and the show isn't affiliated with either company. With that out of the way, the move Firefly makes is an inversion. Don't write a test and hope your student can solve it. Watch a student solve real problems, then write the test backward from what they did.

1:29Finn: Back up one step for me, Juniper, because I want to make sure we land the problem before the fix. When you say "tool-calling," what's the actual thing the model is being trained to do?

1:41Juniper: Picture an assistant that doesn't just talk about booking a flight — it actually hits the airline's API. It emits a structured request: function name, arguments, the whole envelope. The API answers. The model reads the answer and decides what to do next — maybe call a second tool, maybe combine outputs, maybe answer the user. A trajectory is a chain of those calls, sometimes branching, sometimes feeding one tool's output into another's input. That's the skill. And to teach a model that skill, you need examples — thousands of them — of tasks paired with correct sequences of calls and correct final answers.

2:20Finn: And the field's been trying to manufacture those examples synthetically because humans are way too slow and expensive to write a million of them by hand.

2:29Juniper: Exactly. The standard recipe goes forward. A strong model invents a plausible-sounding user task. Then it writes the tool calls that would supposedly solve it. Fast, scalable, and quietly broken — because the tools might not exist the way the model imagined them, the arguments might be wrong, the API might respond completely differently than the model assumed, and the "correct answer" was never verified against anything that actually happened. You end up training on confident-looking fiction.

3:01Finn: And the obvious alternative — just use real APIs and verify everything — runs into its own wall.

3:08Juniper: Real APIs change. Stock prices move. Search results drift. Servers go offline mid-training run. You cannot do reinforcement learning in an environment where the same call returns different things every time, or where your training data source vanishes during epoch four. So the question Firefly is trying to answer is: how do you get training data that's grounded in real APIs, verifiably correct, and stable enough to actually train against — all three at once?

3:37Finn: Okay. So the inversion. Walk me through what they do differently.

3:41Juniper: They flip the order. Instead of generate-then-execute, they execute-then-generate. They take a strong model — Claude Sonnet 4.5 in their setup — and they set it loose on about a thousand real tools across two hundred and forty real servers. And they tell it: just explore. Call tools. Chain outputs into inputs. Build little networks of real executions. Log every single call's input and output to a file. Don't worry about what task you're solving yet — just produce real, working trajectories.

4:14Finn: So at this point there's no "task" at all. There's just a record of: I called tool A with these arguments, it returned this, then I fed that into tool B, which returned this.

4:26Juniper: A flowchart of recorded calls. That's the raw artifact. Then — only then — does the system go back and ask: given this trajectory I just observed, what natural-language question would it answer? And it writes the question from the recorded outputs. The ground-truth answer is just whatever the tools returned, read off the log. The trajectory was real, so the task is provably solvable. The answer came from observed outputs, so the label is correct by construction.

4:57Finn: That's the move. Correctness isn't something you check after the fact. It's a property of how the data got made.

5:05Juniper: And there's a sentence in the paper that captures the whole thesis: this back-chaining construction makes label correctness a property of the data-generation process rather than a post-hoc filtering problem. Most synthetic data work treats validation as the last stage — generate, then filter. Firefly's inversion makes the filter unnecessary for the core label, because the label was extracted from a real execution.

5:33Finn: Give me the concrete example. The one in the paper that makes this click.

5:38Juniper: This is my favorite one, Finn, because it's almost embarrassingly simple. There's a whois lookup tool — you give it a domain name, it gives you back the registration record. During exploration, the AI made two calls to that tool. One for amazon.com, one for netflix.com. The recorded outputs included registration years: nineteen ninety-four for Amazon, nineteen ninety-seven for Netflix.

6:03Finn: And then?

6:04Juniper: And then the system looks at those two recorded outputs and writes a task. The task is: which domain was registered first, amazon dot com or netflix dot com, and how many years apart were they registered? The answer is sitting right there in the log. Amazon, nineteen ninety-four. Netflix, nineteen ninety-seven. Three years apart. That's the example. Two real calls, two recorded outputs, and a question written backward from what the tools actually returned.

6:33Finn: And the beauty is — there's no universe in which that label is wrong. You're not asking a language model to guess what whois would say. You ran whois. The answer is whatever whois said.

6:45Juniper: And that's the entire central insight of the paper compressed into one example. Everything else — and there's a lot of everything else — is engineering to make that work at scale.

6:57Finn: Right, so let's go there. Because "set a model loose on a thousand tools" sounds great in a sentence, but if you actually do that naively, ninety-nine percent of the calls are going to be nonsense. You'd be passing weather forecasts into stock tickers, feeding image URLs into SQL queries. How do they keep the exploration from being mostly garbage?

7:19Juniper: This is your half of the episode, isn't it.

7:22Finn: Probably. The piece that does the heavy lifting here is what the authors call a tool compatibility graph, and I think the cleanest way to picture it is as a recipe book. Imagine you've got a thousand kitchen techniques — chopping, sautéing, whisking, deglazing — and for every pair of techniques, you want to know whether the output of one could plausibly become the input of the next. A whisked egg can go into a hot pan. A sautéed onion cannot be whisked. Most pairs don't compose. A few do. The recipe book is just a giant table of: which technique can follow which.

7:59Juniper: And in Firefly's case, the "recipe book" is built by asking a language model, for every pair of tools, whether the output schema of tool A could realistically feed into the input schema of tool B.

8:11Finn: With a confidence score. They run this judgment across all the pairs. They end up with about eighty-three thousand directed edges in the graph, sixty-four thousand of them at medium-or-higher confidence. Average of eighty-eight plausible successors per tool. And — this matters — the edges cross server boundaries, which is why almost forty percent of the final tasks end up requiring tools from multiple servers chained together.

8:39Juniper: So when the strong model is exploring, it isn't picking the next tool at random from a thousand options. It's picking from the small neighborhood the graph says is plausible.

8:50Finn: Right. It starts at some tool with at least two high-confidence successors — so the trajectory can branch — and then at every step it sees a random subset of compatible next tools. It can call multiple in parallel, chain one output into the next call's input, or combine outputs from earlier nodes. They run this until they hit a budget. The output is roughly ten thousand of these little execution graphs. Each one is a real, working trajectory against real servers.

9:20Juniper: And then the back-chaining happens on top of each graph — pick a subset of nodes, write a task whose answer can be read off those nodes' outputs, treat the other nodes as distractors so the model has to figure out which calls actually matter.

9:35Finn: With one detail that's load-bearing for the training step later: every task comes with a structured answer schema. A little JSON template — named fields, like "amazon_registration_year" or "years_apart" — and a natural-language answer template using those same fields. That structure is what makes binary rewards possible during RL, because you can do field-level exact matching instead of fuzzy text comparison.

10:01Juniper: After all that, an LLM judge filters out the tasks that aren't well-specified — anything ambiguous, anything where the answer isn't really determined by the trajectory, anything with opaque database IDs in the answer. About half the candidates fail. They end up with around five thousand validated tasks.

10:20Finn: Five thousand one hundred and forty-four, if we're being precise. Which is small compared to some other synthetic tool-use datasets, but the pitch is: every single one of these is verified against a real execution.

10:33Juniper: Okay. So now we have a dataset. We've got tasks, we've got ground-truth trajectories, we've got structured answers. But we still can't just train against live APIs, because the APIs will drift and the training loop will break. So how do they actually run the reinforcement learning?

10:50Finn: This is the third piece, and it's the part that I think is the cleverest engineering move in the paper, even though the inversion is the headline. They build a simulator. Picture a flight simulator versus a real plane. You don't train pilots on real planes because every flight has different weather and the consequences of mistakes are catastrophic. A flight simulator replays recorded conditions so every trainee can fly the exact same approach to the exact same runway in the exact same crosswind, over and over.

11:22Juniper: And Firefly does the same thing for API calls.

11:25Finn: Every call that was ever made during exploration sits in a cache. During RL training, the agent doesn't talk to the live servers. It talks to the simulator. The agent emits a call — function name, arguments — and the simulator checks: do I have this exact call in my cache? If yes, replay the recorded response. If no, do I have something similar? If yes, improvise from the nearest historical examples. If neither — return an error.

11:54Juniper: And there's a number from the actual training run that I think tells you something important. When they ran the trained agent through the simulator, forty-two percent of its calls were resolved by exact cache match. Fifty-eight percent by the fuzzy fallback. And zero percent fell into the "I don't have any data for this" bucket.

12:15Finn: Which is interesting in two directions. The authors present that zero percent as evidence of coverage — the cache was complete enough that the agent never hit a wall. And that's a reasonable read. The skeptic's read, which we should also voice, is that the agent has learned to stay close to the training distribution. It's never trying anything genuinely novel. We'll come back to that.

12:42Juniper: With the simulator in place, the actual RL is almost the boring part. They use a method called GRPO — the only thing the listener needs to know about it is that the model attempts each task eight times, the attempts that get the right answer get rewarded, and the parameters drift toward the strategies that worked. No critic network, no value function — just sibling attempts competing against each other.

13:08Finn: And one nice touch: once the model gets all eight attempts right on a particular task, that task gets dropped from training. So the model spends its compute on the things it still finds hard.

13:21Juniper: Three hundred training steps. Batch size sixteen. Base model is Qwen 3, four billion parameters, the "thinking" variant. And then — Finn, this is where you've been waiting to land your number, isn't it?

13:33Finn: The four-billion-parameter model goes from twenty-eight point one percent on the held-out test set to forty-one point five percent after RL training. Which matches Claude Sonnet 4.6, which scores forty-two point two on the same test. A four-billion-parameter open model, tied with a flagship frontier model on a tool-calling benchmark.

13:55Juniper: That's the headline. And at pass-at-eight — where the model gets eight shots and you count it as solved if any of them works — the trained four-billion model actually pulls ahead. Fifty-two point eight versus Sonnet's fifty point three. It starts approaching Claude Opus.

14:12Finn: Now. This is where I want to be careful, because the headline number is real but it deserves some scrutiny. Juniper, what's the catch on the in-distribution result?

14:22Juniper: The catch is the simulator. The Firefly test set is evaluated through the same simulator the model was trained against. And the simulator returns errors for calls that are far outside the cached region. Which means the evaluation environment is, by design, very kind to behavior that stays inside the training distribution. The Claude baselines are evaluated under the same protocol, but Claude doesn't know which calls are in the cache and which aren't. So there's at least a structural reason to suspect the home-field advantage is doing some work.

14:56Finn: Which is why the transfer benchmarks are the more honest signal. They take the trained four-billion model and they run it on benchmarks where the simulator isn't in the loop at all. Tau2-Bench Retail, which is a multi-turn customer service environment. Tau2-Bench Airline. Some MCPMark file system and database tasks.

15:16Juniper: And the gains are real, but smaller. Tau2-Bench Retail goes from forty-nine percent to about sixty-three. Airline from thirty-six to about fifty-two. MCPMark File System from forty percent to sixty. Postgres Easy from seventy to eighty. These are genuine improvements on benchmarks the model never saw, evaluated against real environments, and they're particularly interesting because the Firefly training data is all single-turn — one user message, then the model has to figure out the right multi-step tool sequence. The fact that it transfers to multi-turn dialogue benchmarks suggests it's learning something like real tool-use skill, not just memorizing the training distribution.

16:02Finn: Though a hard-nosed skeptic could still ask: how much of the transfer is genuine tool-use competence, versus learning to produce structured outputs in a particular format that happens to be rewarded across these benchmarks too? The paper doesn't fully separate those.

16:21Juniper: And on one of the transfer tasks — Tau2-Bench Telecom — the improvement is essentially in the noise. Eighteen point nine to twenty point four. So it's not uniform.

16:32Finn: There's a bigger structural worry I want to put on the table too, Juniper, because it's not in the headline critique but I think it's the most interesting one. Nearly every quality decision in this entire pipeline is being made by a strong LLM judge. The edges in the tool compatibility graph — LLM judgment. The task validation step that filters half the candidates — LLM judgment. The simulator's fuzzy-match fallback, when there's no exact cache hit — LLM judgment. The reward signal, when the structured field-matching fails — LLM judgment.

17:08Juniper: It's LLM judges all the way down.

17:11Finn: And none of those stages are independently validated against human annotation in the paper. The whole edifice rests on the assumption that Claude Sonnet 4.5 — which is what's doing all that judging — is well-calibrated across all these very different subjective tasks. Which it might be. But it's an assumption the paper doesn't really stress-test, and if it's wrong in subtle ways, the errors could be correlated in ways that are hard to detect from the inside.

17:43Juniper: It's a pipeline where the same family of model is generating, judging, exploring, and validating. The wins are real but the foundation has a single point of philosophical failure.

17:55Finn: The authors themselves flag two limitations directly. One is model scale — they only trained a four-billion-parameter model because of compute. The gap between pass-at-one and pass-at-sixteen is modest, and they read that as evidence that the four-billion model lacks the capacity to solve many of the harder tasks even with multiple attempts. A bigger model would probably raise both numbers, but we don't get to see that experiment.

18:25Juniper: The second is single-turn only. Every task in Firefly is a single user message. The multi-step structure is in the tool calls, not in the dialogue. Real tool use often involves the user clarifying, changing their mind mid-task, asking follow-ups. Extending this pipeline to multi-turn dialogues is genuinely open work.

18:48Finn: Right. So with all those caveats stated honestly — what's left? What's the thing to actually take away?

18:55Juniper: Two things, I think. One is practical, one is conceptual. The practical thing is a price tag. The entire Firefly dataset cost about forty-seven thousand dollars to generate. Twenty-three and a half billion tokens through Claude Sonnet 4.5 on AWS Bedrock. That's a one-time spend for an artifact that lets you train a four-billion-parameter open model to a level that's at least in the conversation with a frontier proprietary model on tool calling. Compare that to the cost of human annotation at that scale — which would be staggeringly more, and probably wouldn't produce data of equivalent verifiability. The dataset and simulator are released. So this is a meaningful shift in who can train a competitive agent.

19:40Finn: And the conceptual thing?

19:42Juniper: The conceptual thing is the inversion itself. Correctness as a property of the generation process, not a filtering step. Once you see that move applied to tool calling, the same pattern almost certainly generalizes — any agentic domain where actions can be executed and observed should be amenable to the same flip. Explore the environment, record what happens, then back-chain a task whose answer is sitting in the recording.

20:09Finn: The whois example is the version of that idea I keep coming back to. The reason the answer to "which was registered first, amazon or netflix" is guaranteed to be correct isn't because someone checked it. It's because the answer was extracted from the actual whois call. Nobody had to verify it. Verification was structurally impossible to skip.

20:32Juniper: Right. Don't write a test and hope your student can solve it. Watch a student solve real problems, and write the test backward from what they did.

20:41Finn: There's a phrase in the paper I like for the simulator design too — the authors call the no-data error tier "conservative," and they defend it by arguing that calls far outside the cached trajectory are unlikely to be necessary for a correct solution and shouldn't be rewarded. Which is a principled defense, and also — as we said — the same property that makes the in-distribution number look generous. Both things can be true.

21:06Juniper: That's where I want to land it, actually. This is a paper where the central idea is genuinely clean — the kind of move that seems obvious in retrospect and a little embarrassing the field didn't do sooner. And the engineering around it is careful and clever. And the empirical result is striking. And there are real reasons to want to see it replicated at larger scale with independent evaluators before declaring victory. All of those are simultaneously true, and I think the episode is worth more if we hold all of them rather than picking one.

21:39Finn: The paper's linked in the show notes along with some related reading if you want to keep pulling on this thread. And if you want the full transcript with the technical terms defined inline, plus the concept pages that connect this episode to other things we've covered on tool-use and synthetic data, that all lives on paperdive dot AI.

21:59Juniper: Thanks for listening to AI Papers: A Deep Dive.