One Loop to Optimize Them All: A Universal API for LLM-Driven Discovery

0:00Cassidy: There's a moment in this paper where the authors quietly line up five different LLM-based optimization systems — AlphaEvolve, FunSearch, GEPA, ADAS, OpenEvolve — and point out something almost embarrassing. Every one of them has its own framework. Its own vocabulary. Its own conceptual scaffolding. AlphaEvolve evolves code. GEPA evolves prompts. ADAS evolves agent architectures. But if you squint at the algorithms, they're all doing the same thing. Serialize something as text. Score it. Ask a language model to make it better. Repeat.

0:36Finn: And the question that paper builds itself around is whether any of that specialization was actually load-bearing. Or whether the field has been living through its proprietary-connector era — one cable per device, locked to its manufacturer — when all along, one cable could have carried it all.

0:56Cassidy: That's a useful frame. And it's worth situating in time, because this paper went up on arXiv on May nineteenth, twenty-twenty-six, and we're recording three days later. What you're hearing is AI-generated — the script is from Anthropic's Claude Opus 4.7. I'm Cassidy, that's Finn, and we're both AI voices from Eleven Labs. Neither company is involved in producing this show. The paper itself is called "optimize anything: A Universal API for Optimizing any Text Parameter," from a UC Berkeley team with collaborators at MIT, including some of the same authors who built GEPA — which is one of the systems this work generalizes.

1:39Finn: And the headline result is that the generalization works. One declarative API, no per-domain tuning of the search algorithm, hits state of the art across six different problem types. They beat AlphaEvolve's published number on circle packing with about a third of the budget. They lifted Gemini Flash's score on ARC-AGI from roughly a third to nearly ninety percent — starting from a ten-line agent and ending at a three-hundred-line pipeline the system designed itself. They got eighty-seven percent of generated CUDA kernels to match or beat hand-written PyTorch on KernelBench. Six domains. One loop. No framework swap.

2:19Cassidy: So that's the unification thesis. And it's worth pausing on what it would mean if it holds. Because for the last two years, the way you'd approach LLM-based discovery has been to pick your domain — "I want to evolve a CUDA kernel" — and then pick the framework built for that domain. You'd learn AlphaEvolve's island topology, or GEPA's reflection prompts, or ADAS's mutation templates. The paper is arguing that all of that is scaffolding. The actual engine is much simpler.

2:51Finn: Walk me through what they think the actual engine is, Cassidy.

2:55Cassidy: It's three lines, more or less. You hand the system a starting artifact — any piece of text, however bad. You hand it an evaluator, which is a function that takes the artifact, runs it on some example, and returns two things: a score, and what they call "side information." Then the system runs a loop. It samples a candidate from a pool. It runs the candidate through the evaluator. The language model reads the side information, proposes a fix. The fix goes back in the pool. That's the loop. The artifact can be a kernel, a prompt, a scheduling algorithm, an SVG, an entire agent architecture. The loop doesn't know or care.

3:36Finn: And the load-bearing word in that description is "side information," because that's the part that makes this not just trial-and-error with extra steps.

3:46Cassidy: Right. Side information is the move. And the analogy the paper itself reaches for is genuinely good — they call it the text-optimization analog of a gradient.

3:56Finn: Unpack that for a second, because gradients are doing very specific work in normal optimization.

4:03Cassidy: Here's the cooking version. Imagine you're teaching someone to cook by tasting each dish they make. You have two ways to give feedback. One: "six out of ten, try again." Two: "the sauce broke because you added the butter too fast — let it cool next time." Both of those are feedback. Only the second one tells the student which direction to move. Classical numerical optimizers are stuck with version one. They literally cannot consume anything but a number. The paper has a line I think is the sharpest sentence in the whole work. You cannot show a Bayesian optimizer a stack trace.

4:41Finn: That lands.

4:42Cassidy: But you can show an LLM a stack trace. You can show it a profiler dump. You can show it a rendered image of the SVG it just produced and ask whether it actually looks like a pelican on a bicycle. You can show it which three test cases it failed and what the expected outputs were. All of that is side information. The evaluator returns the score, sure, but it also returns whatever diagnostic the human writing the evaluator can produce. And the proposer LLM reads it like a senior engineer reading a postmortem.

5:15Finn: And the ablation on this is where the paper earns the analogy, because they actually turn side information off and see what happens. Across three different domains, they ran the system with side information versus the system with only the scalar score. Score-only is what a classical black-box optimizer sees. With side information, convergence is four to six times faster. On the Facility Support Analysis prompt task, score-only needed around six hundred rollouts to reach what side-information hit in a hundred. And the final score was higher too. So it's not just that you arrive sooner — you arrive somewhere better.

5:55Cassidy: Which makes intuitive sense once you've heard the cooking analogy. But it reframes what optimization in this paradigm actually is. The hard part isn't the search algorithm. The hard part is designing the evaluator so that when something fails, the failure tells the system *why* it failed. The authors have a phrase for this — they say the framework "trades optimization expertise for domain expertise." You don't need to know how to build a Bayesian optimizer. You need to know what diagnostics matter in your domain.

6:29Finn: And that's a real trade, not a free lunch. We'll come back to that when we get to the critique, because "design good side information" is itself a craft. But before that, there's a second piece of machinery worth understanding — the one piece of the system doing structural work beyond just calling the LLM. And that's how they keep the population of candidates from collapsing into one mediocre local optimum.

6:55Cassidy: The Pareto frontier piece.

6:57Finn: Yeah. And the word Pareto can be a little intimidating, but the intuition is just about who survives in the candidate pool. Here's the cleanest way I've found to think about it. Imagine you're assembling a team to win at every individual Olympic event — sprinting, swimming, gymnastics, all of it. If you ranked athletes by their average performance across all events, you'd end up picking decathletes. Generalists. And you'd quietly cut all the specialists — the best sprinter in the world, the best swimmer in the world — because their averages aren't great. Your swimmer is terrible at gymnastics.

7:32Cassidy: And in this analogy the specialists are the candidates that are excellent at one thing and bad at others.

7:39Finn: Exactly. So instead of ranking by average, the Pareto frontier asks a different question. For each individual event, who's the current champion? Any candidate who's the best at *anything* — even just one thing — survives. The frontier is the roster of champions, one per event. When the system wants to propose a new candidate, it samples a parent from that roster weighted by how many events that parent is the champion of. You maintain a portfolio of complementary strategies, and the LLM proposer can recombine them. The candidate that's the best sprinter might inspire a refinement that's also a decent swimmer. You don't get there if you've already cut the sprinter for being a bad decathlete.

8:20Cassidy: And the events in the actual algorithm are individual test examples — or individual sub-scores returned in the side information. So when they're evolving a CUDA kernel and the evaluator returns "your speedup was one-point-two, you used this much shared memory, you had this register pressure, you hit this race condition on the small input" — each of those becomes a dimension on the frontier. A candidate that's the best at one of them is preserved even if it's mediocre elsewhere.

8:48Finn: Okay. So you've got the loop, you've got side information as the gradient analog, you've got the Pareto frontier as the diversity mechanism. I think the listener has enough machinery now to follow what the system actually pulls off. And the ARC-AGI experiment is where I want to spend some time, because it's the most cinematic moment in the paper.

9:10Cassidy: Take it.

9:10Finn: So ARC-AGI is a puzzle benchmark. Small grids of colored squares. Each puzzle gives you a few input-output examples — show, here's grid A and it becomes grid B; here's grid C and it becomes grid D — and then it asks you to apply the same hidden transformation rule to a new input. It's designed to be easy for humans and brutally hard for LLMs, because pure pattern matching doesn't get you there. You actually have to induce the rule. It's the flagship reasoning benchmark.

9:40Cassidy: And the question the authors are asking on ARC-AGI is, frankly, weird. They're not trying to evolve a better prompt. They're not trying to evolve a better algorithm. They're trying to evolve the entire agent.

9:52Finn: Right. The artifact they're optimizing — the thing the LLM is rewriting on every iteration — is the agent itself. The source code of the program that takes an ARC puzzle in and produces an answer out. They start with ten lines. Literally ten. It's a single LLM call: "here's the puzzle, give me the answer." Gemini 3 Flash, running that ten-line agent, scores about thirty-two and a half percent on the ARC-AGI test set.

10:18Cassidy: Which is roughly the baseline for a frontier model doing zero scaffolding.

10:23Finn: Yeah. Now they kick off the optimization loop. The evaluator runs the agent on training puzzles, reports which ones it got right and wrong, includes the model's reasoning trace, includes which puzzles were even close. All of that is side information. The proposer LLM reads it and rewrites the agent. New agent goes back in the pool. They let this run.

10:45Cassidy: And what they end up with, at the end of the loop, is something nobody wrote.

10:50Finn: Three hundred lines of code. Four stages. The system has, on its own, discovered that it should first do rule induction — analyze patterns across the input-output pairs to hypothesize what transformation is happening. Then it should generate executable code that implements that hypothesis. Then it should run that code with a verification step — actual Python execution to check whether the generated code reproduces the training examples. Then if the code fails, there's an iterative debugging phase that tries up to two fix attempts. And if all of that fails, there's a structured fallback to direct LLM prediction. Verify-then-fallback. Iterative refinement. These are architectural patterns that human engineers usually spend weeks discovering by hand.

11:38Cassidy: And the test-set number?

11:39Finn: Eighty-nine and a half percent. From thirty-two to nearly ninety. Validation reached ninety-three and a half.

11:46Cassidy: That's the thing about this result that's been sitting with me. It's not that the system found a better prompt or a smarter trick. It's that the *architecture* of the agent emerged from the optimization process. The system discovered separation of concerns. It discovered verification. It discovered fallback strategies. The only thing the human provided was the evaluator — "run this code, tell me which puzzles it got right and what the reasoning looked like" — and the score gradient that emerges from that evaluator was enough to push the population toward an architecture that, in retrospect, makes a lot of engineering sense.

12:27Finn: And I think the right way to picture this — to keep it from sounding magical — is a startup discovering its own org chart. Two people start a company, and one person does everything. Sales, code, support. As the company scales, structure emerges. Not because anyone planned the org chart in advance, but because the work itself demands separation of concerns. The ARC-AGI evolution is the same shape. The seed agent is the one-person startup. The final agent has the structure the work itself demanded — because at every step, the side information was pointing at something the previous version couldn't do, and the proposer was rewriting to fix it.

13:10Cassidy: That's a good framing, Finn. The intent in the system isn't a human plan. The intent is the score function. The structure is the score function's shadow.

13:21Finn: Okay, the second set piece. Circle packing. This one is the most controlled comparison in the paper and I think it's where the result is hardest to argue with.

13:31Cassidy: Walk through what circle packing even is, because the listener may not know.

13:37Finn: It's a classical optimization problem in math. You're given a unit square. You have to pack n non-overlapping circles inside it, and the score is the sum of their radii. For most values of n, the optimal packing is genuinely unknown. People have been working on these problems for decades. DeepMind's AlphaEvolve, last year, set a published record on circle packing for n equals twenty-six. They reported a sum-of-radii of about two point six three. Which was, at the time, a real result.

14:09Cassidy: And optimize anything goes after the same problem.

14:13Finn: Same problem, n equals twenty-six. They run their system using GPT-5.1 as the proposer. The artifact being optimized is a Python program that produces a circle packing. The evaluator runs it, scores the result, returns side information about which circles overlap, which circles have room to grow, the geometry of the failed configuration. The system runs for sixty-three evaluations.

14:38Cassidy: That's a small number.

14:40Finn: That's a tiny number. Sixty-three evaluations, total cost about three dollars and eighteen cents. And the final score is two point six three six — beating AlphaEvolve's published number.

14:53Cassidy: Three dollars.

14:54Finn: And the controlled comparison is the part that makes it tight. They also ran OpenEvolve — which is the open-source reimplementation of AlphaEvolve — with the same proposer LLM, on the same problem, with more than three times the budget. Two hundred evaluations, about seven dollars. And OpenEvolve, at three times the cost, doesn't even match optimize anything. It tops out at two point six three zero.

15:21Cassidy: So you can't blame the proposer. Same model, same problem. The architectural choices in optimize anything are doing real work.

15:29Finn: And there's a mechanism the appendix walks through that I think is the most interesting thing in the paper from a "what's actually new here" standpoint. It explains why this kind of result is even possible.

15:43Cassidy: The refiner leapfrog.

15:45Finn: Right. So when optimize anything is running on circle packing, it's actually tracking two different artifacts on the Pareto frontier. One is the code itself — the program that produces the circle packing. The other is the LLM prompt that refines the code. Both are being optimized in parallel, both are on the same frontier, both can leapfrog each other.

16:08Cassidy: Spell out the leapfrog.

16:10Finn: Okay. Early in the run, the code is doing some weak greedy heuristic — places circles one at a time, score around point nine eight. The refiner prompt, in parallel, hits on the idea of solving this with linear programming. The refiner itself isn't running the LP — it's prompting the proposer to try LP-based approaches. The refiner's score jumps, because when it gets applied to candidate code, the resulting code is much better. The proposer reads that, and the next version of the code absorbs LP directly. Code score jumps to two point six one. Now the refiner pushes forward, discovers sequential linear programming — an iterative refinement of LP — and gets to two point six three. The code, watching the refiner's diagnostics, absorbs SLP. And that's how you reach the record.

16:59Cassidy: So each module's advance becomes the foundation for the other module's next leap.

17:04Finn: Exactly. And this is something that AlphaEvolve-class systems structurally cannot do. They're optimizing one artifact. They don't have a separate refinement module that's also on the frontier. The two-artifact Pareto front is what creates the leapfrog dynamic. And the leapfrog is, I think, what's actually behind the cost efficiency — because every advance in one module gets compounded by an advance in the other.

17:30Cassidy: That's a much more interesting result than "we beat AlphaEvolve." Because "we beat AlphaEvolve" is a benchmark. The leapfrog is a mechanism. It says there's a kind of compounding that happens when you optimize multiple modules against the same diagnostic stream.

17:46Finn: And the framing the paper offers is — picture a few researchers in nearby offices each working on their own problem, but leaving their lab notebooks open in a shared room. One researcher finds a trick, writes it down. Another wanders in, sees it, adopts it for their own problem. Meanwhile the first reads the second's notebook and picks up something else. Techniques migrate. Each researcher still has their own final answer, but the methods get shared. That's what the Pareto frontier is doing — across modules, across tasks. Same mechanism.

18:22Cassidy: And that brings us to the third mode, which is multi-task. Because the same logic that makes refiner-and-code leapfrog each other also makes related problems share discoveries. So multi-task is the case where you have a batch of related problems — say, twenty different CUDA kernels to write — and instead of optimizing each one in isolation, you share a single Pareto frontier across all of them. A candidate that's the best at kernel seven and a candidate that's the best at kernel twelve both survive, and the proposer can be seeded with one while looking at the other.

19:00Finn: And on CUDA kernels, that wins clearly. At matched per-problem budget, multi-task outperforms single-task, and the gain grows with the number of related tasks. Which makes sense, because kernels share structure. A memory-coalescing trick that helps one kernel often helps another. A tiling pattern that wins for matrix multiply often wins for normalization. The frontier surfaces those transferable patterns, and they migrate across problems.

19:29Cassidy: But — and this is where the paper is unusually honest — multi-task doesn't always win.

19:35Finn: It hurts on circle packing across different values of n. They show this explicitly. If you try to optimize n equals ten and n equals twenty in the same run, the discoveries don't transfer. The optimal configuration for ten circles has essentially nothing to teach a configuration for twenty. The frontier ends up injecting noise instead of useful patterns. And so multi-task degrades performance compared to optimizing each n separately.

20:01Cassidy: Which I think is the right kind of negative result to publish. It tells you what multi-task is actually doing. It only helps when the problems share transferable structure. And identifying which problems share structure — that's a judgment call the user has to make. The framework doesn't tell you.

20:19Finn: Right. And Cassidy, this is a good seam to pivot into critique, because the paper is honest about a handful of caveats worth voicing alongside the impressive numbers. Let me take the critique side — there are three things a careful listener should hold in mind.

20:35Cassidy: Go.

20:36Finn: First. The proposer LLM is doing a lot of the work. There's a table in the paper where they swap from GPT-5.1 — the frontier proposer they used for the headline numbers — to GPT-5-nano, which is much cheaper. Circle packing drops from two point six three six to two point five one two. That's not a small drop. AIME drops from sixty percent to fifty. So when the paper says "this generic framework matches specialized systems," part of what's being measured is "frontier proposers are extremely good." The architectural contributions are real, and the OpenEvolve comparison with matched proposers shows they're doing real work. But a meaningful share of the headline numbers comes from the model, not the algorithm. A skeptic should hold both things at once.

21:22Cassidy: That's fair. And the authors don't hide it — that table is in the main paper, not buried in the appendix.

21:28Finn: Second. The "state of the art across six domains" framing leans on careful counting. On most domains it's a real claim. On AIME specifically — the math benchmark — the result is sixty percent, up from forty-seven percent for the seed prompt. That beats MIPROv2, which is a specialized prompt optimizer. But it doesn't beat GEPA, which is the authors' own prior prompt-optimization system. It matches it. Which means on prompt optimization, optimize anything's universal API isn't an improvement over the specialized predecessor. It just doesn't lose. That's still meaningful — "the generality doesn't cost you" — but it's a different claim than "this is state of the art on prompts."

22:10Cassidy: The universal API doesn't dominate the specialized one on the specialized one's home turf. It ties. Which is honestly what you'd hope for.

22:19Finn: And third — this is the one I think builders should think hardest about. The framework trades optimization expertise for domain expertise. That's the authors' own framing, and it's a good one. But the trade is real. The reason the ARC-AGI evolution worked is that someone designed an evaluator that returns rich diagnostic information — which puzzles failed, why, what the reasoning trace looked like. The reason circle packing worked is that someone designed an evaluator that returns the geometry of overlaps. The system is universal in the sense that the loop is the same. It is not universal in the sense that you can hand it any problem with any evaluator and expect magic. If your evaluator returns only print statements, you're leaving most of the value on the table. Side-information design is craft, and the headline numbers come from expert-designed side information.

23:12Cassidy: And that, I think, is the right place to land the critique. Because it tells you what the actual research agenda becomes. If you accept the unification thesis — if a single declarative loop really does handle all of these domains — then optimization in this paradigm stops being about "do I have the right specialized solver" and starts being about "how good is my evaluator, and how informative is the feedback it emits."

23:39Finn: Which is a fundamentally different question to be working on. And it's a more democratic one, in a sense. Anyone who deeply understands their domain can write a good evaluator. You don't need to be an expert in island topologies or MAP-Elites or any of the framework-specific machinery.

23:57Cassidy: And the proof of concept for that argument is sitting in the paper. Six domains, no per-domain framework changes, results that match or beat specialized systems. If the next year of work in this space looks more like "evaluator design as a discipline" and less like "yet another framework for yet another artifact type," I think this paper will look like it called the shift early.

24:20Finn: One more thing worth flagging before we wrap. The economic story. Circle packing optimized for three dollars and eighteen cents. ARC-AGI optimized for about a hundred and forty-five dollars — and almost all of that was the agent evaluator running on training puzzles. Only seventy cents of it was the LLM doing reflection. That ratio matters. It tells you the system is sample-efficient. It also tells you that if you have an expensive evaluator, optimization is going to be expensive regardless. The framework isn't doing magic on cost. It's doing magic on how few evaluator calls you need.

24:57Cassidy: And the four-to-six-x speedup from side information is the same story from a different angle. Better feedback means fewer calls means lower cost. The frontier of LLM-based search efficiency, the paper argues, sits in the quality of the diagnostic.

25:13Finn: Which brings us back to the gradient analogy, Cassidy. The reason gradient descent dominated numerical optimization for sixty years isn't because the algorithm was clever. It's because gradients carry an enormous amount of information per evaluation. Side information, the paper is arguing, is the LLM-era analog. Whoever figures out how to make the diagnostic signal as rich and as cheap as possible — that's where the next round of progress comes from.

25:40Cassidy: That's the read I came away with too. The unification is the headline. The gradient analogy is the idea. And the leapfrog mechanism on circle packing is the place where it's most clearly doing something genuinely new — not just running an LLM in a loop, but compounding discoveries across modules that are both watching the same diagnostic stream.

26:01Finn: That's a good place to stop.

26:03Cassidy: The show notes have a link to the paper and some related reading if you want to go deeper — there's a thread of work running through GEPA and AlphaEvolve and FunSearch that this paper is in direct conversation with.

26:16Finn: And if you want the full transcript with definitions baked in, plus links over to the other episodes where we've touched these ideas, that's all on paperdive dot AI. Every term we used is tappable, and the concept pages tie it back to the broader thread.

26:31Cassidy: Thanks for listening. This has been AI Papers: A Deep Dive.