An AI Agent Swapped In Focal Loss And Beat A Human-Tuned Training Script

0:00Cassidy: An AI agent looks at a small GPT training script that humans have been tuning for months. It has five minutes of GPU time per attempt. It decides cross-entropy loss — the standard objective every language model is trained on — isn't quite right for the data. It swaps in focal loss instead. And in that one move, it beats the published human reference baseline.

0:24Finn: That happened. May fifteenth, twenty-twenty-six. A team at FAIR — Meta's research lab — put out a paper called "Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design," and we're recording four days later. Quick ground rules before we dig in: this episode is AI-generated, the script is from Anthropic's Claude Opus 4.7, and the two voices you're hearing — that's Cassidy, I'm Finn — are AI voices from Eleven Labs. Neither company is involved in producing this show. And the reason that focal-loss moment matters is not just that an agent did it. It's that the same group of researchers had agents do this kind of thing across a whole landscape of architecture design — and the agents came back with neural networks that beat Llama 3.2 at the one-billion-parameter scale.

1:16Cassidy: Right. So let's set up the puzzle. For nearly a decade now, the Transformer has been the default brain of every serious language model. Attention plus MLP, stacked over and over. It works astonishingly well. But attention has costs. It scales quadratically with sequence length — double the sequence, four times the work — and the KV cache eats memory at inference. So the field has been quietly migrating toward "hybrid" architectures, mixing attention with other primitives. Mamba is the big one — a state-space model that processes sequences in linear time instead of quadratic. Models like Nemotron from NVIDIA, Jamba, Qwen3-Next — these are real production systems that aren't pure Transformers anymore.

2:02Finn: And once you let yourself mix and match, the design space explodes. The authors frame it this way: imagine sixteen layers in a stack, and at each layer you can put attention, an MLP, or a Mamba block. With three primitives over sixteen slots — that's about forty-three million possible arrangements. Humans can't search forty-three million architectures by intuition. Traditional Neural Architecture Search uses Bayesian optimization or evolutionary algorithms — but those are rigid, mechanical procedures. They propose a candidate, test it, adjust some hyperparameter, propose another. There's no reasoning about why anything works.

2:45Cassidy: So the authors ask: what if we hand the search to an LLM agent instead? Not a one-shot LLM — an agent in a loop. It can read papers, form hypotheses, write code, look at validation scores, reason in natural language about *why* one arrangement might be better than another. You get something that pairs human-like intuition with the brute-force capacity to evaluate hundreds of candidates. And there's a philosophically loaded version of this. If AI agents can design better neural architectures, then the models powering future agents could be designed by earlier versions of those same agents. That's the recursive self-improvement frame. The paper is taking an honest first step into that territory, and we'll come back to how honest, because the limitation discussion in this paper is unusually clean.

3:39Finn: Let's get the headline empirical claim on the table. The authors build two frameworks. The first one — AIRA-Compose — is the Lego version: agents arrange predefined blocks into a sixteen-layer pattern, the winners get scaled up to one billion and three billion parameters, and then evaluated against Llama 3.2 and Nemotron. The second framework — AIRA-Design — is harder. Agents have to write the actual code. A working attention mechanism from scratch, or an entire training script.

4:12Cassidy: And in both setups, agent-found designs match or beat the strongest human baselines. The agent-discovered model the paper calls "AIRAhybrid-D" gets the lowest validation loss of any architecture they tested at one billion parameters. That's the headline. The qualifier — which the authors put right out front — is that when agents write attention mechanisms from scratch, what they produce is competent engineering synthesis of techniques from prior work. They're not inventing new theoretical attention mechanisms. We'll come back to that distinction. It's the most intellectually interesting tension in the paper.

4:53Finn: Walk me through Compose first. The Lego version. What does an agent actually do here?

4:59Cassidy: Sure. So imagine you're an agent. The harness hands you a task description: there are sixteen layers, you have three kinds of building blocks — attention, MLP, Mamba — and your job is to output a string. Something like "attention, Mamba, MLP, MLP, attention" — sixteen of those. Then you write a small evaluation script, the harness trains a tiny version of your architecture, and gives you back a validation loss. If you're in greedy mode, you iterate on your best result. If not, you start fresh. And the thing that's wild — and Figure 3 in the paper just captures this beautifully — is what these agents write down as they reason. A GPT-5 agent in the three-primitive search wrote design rationales like "Periodic Attention Anchors with Mamba-Duo Segments and Attention-Separated Gated Mamba Islands." Or: "Compensate for the lack of residuals: FAILED." Or: "Hybrid funnel with front-loaded Mamba corridor and late Attention bottleneck plus Alternating Tail."

6:04Finn: That last one sounds like an NSF grant proposal.

6:08Cassidy: Right? It reads like a researcher's lab notebook compressed into code. The agent isn't just guessing — it's giving its candidates *names*, attaching theoretical rationales, noting when an idea didn't pan out. Multiply that by eleven different agents, powered by GPT-5, GPT-4o, gpt-oss, CWM, o3-mini — each running for twenty-four hours, ten random seeds each — and you end up exploring about twenty-three hundred unique architectures. That's roughly three percent of the smaller two-primitive search space. A tiny slice. And yet they find patterns that beat the rigid optimization-based methods.

6:47Finn: One clever design choice worth flagging, Cassidy. The agents are constrained to output a formatted string — basically a list. That sounds boring but it matters. It means the submission rate is essentially a hundred percent. You can't fail by writing broken code; you can only fail by proposing a bad architecture. So the comparison cleanly isolates the agent's architectural reasoning from its coding ability. Compose is testing one thing at a time.

7:16Cassidy: Exactly. And then there's an aggregation step. After all the agents finish, the authors cluster the top-performing patterns to find the dominant arrangement across many high-scoring candidates. That winning pattern gets scaled up — twice the depth, three times the depth — and trained at one billion or three billion parameters on thirty-seven and a half billion tokens. That's the model that gets compared to Llama 3.2.

7:44Finn: And the result, Cassidy?

7:45Cassidy: AIRAhybrid-D — the best agent-found architecture at the one-billion scale, in the version with three primitives — gets a validation loss of about 2.72. Llama 3.2 at the same size gets about 2.82. That's a meaningful gap. In language modeling, each hundredth of a loss point represents a non-trivial improvement in prediction quality. And on a six-task average of downstream benchmarks — common-sense reasoning, science, that kind of thing — the agent model is 3.8 percentage points better than Llama.

8:18Finn: Okay. I want to push on the scaling claim because that's where the headline gets sharper, and it's also where the strongest engineering argument lives. When the paper says AIRAformer-C scales fifty-four percent faster than Llama 3.2, what does that actually mean? Because a listener might hear that as "it's fifty-four percent better," and that's the wrong picture. The right picture is fuel efficiency. Every neural architecture has a curve that shows how much better the model gets as you spend more compute on it. Two cars can be going the same speed right now but have very different mileage curves — and at long distances, the difference compounds enormously. The authors run what they call isoFLOP experiments: they train each architecture at three sizes — 350 million, one billion, three billion parameters — under five different compute budgets. For each budget, they figure out which model size is optimal. Connect those optimal points and you get the architecture's scaling frontier. The slope of that frontier is what tells you how efficiently it converts compute into quality.

9:28Cassidy: And the agent-found architectures have steeper slopes.

9:31Finn: Steeper slopes. AIRAformer-C scales fifty-four percent faster than Llama 3.2. It scales seventy-one percent faster than the best Transformer that the authors' own prior Bayesian-optimization system was able to find. And the hybrid version — AIRAhybrid-C — scales twenty-three percent faster than Nemotron, which is NVIDIA's flagship hybrid model. If those slopes hold, the gap compounds. At the scales hyperscalers actually train at — tens of billions of parameters, hundreds of millions of dollars in compute — small slope differences turn into huge cost differences. That's the practical implication.

10:10Cassidy: And there's something subtle in there worth pulling out. The architectures that win at fixed *token* budget — train everyone on the same thirty-seven-and-a-half billion tokens — are not the same ones that win at fixed *compute* budget. The attention-heavy architectures — AIRAformer-C and D, which stack lots of attention layers — win on a fixed token budget. They have more representational capacity per layer. But when compute is the binding constraint, more balanced architectures win, because attention is expensive per FLOP. So the agents are uncovering a real trade-off, not a contradiction.

10:48Finn: Right. And that's the kind of nuance that pure performance numbers obscure. "Better" is conditional on what's scarce. Okay, let me put on my skeptic hat, because I have real concerns about how strong these claims actually are.

11:02Cassidy: Go.

11:02Finn: First concern. The whole Compose pipeline evaluates candidates at a few million parameters — tiny, fast-to-train proxies. Then the *winning* patterns get scaled up to one billion and three billion for the headline comparison. This is the classic proxy-to-scale gap in neural architecture search. Small-scale rankings are known to correlate imperfectly with large-scale performance. It's basically the reason NAS has historically struggled to deliver. The authors acknowledge it, but they don't really resolve it. A skeptic would want to see the full Spearman correlation between the small-scale and large-scale rankings — and we don't get that. So you can't fully separate "the agents had genuine architectural insight" from "they had good luck with the proxy in this regime."

11:50Cassidy: That's the cleanest version of the critique. Keep going.

11:54Finn: Second concern. The "scales fifty-four percent faster" claim rests on fits to three data points — 350M, 1B, 3B. Across five compute budgets, sure, but the architecture-size dimension only has three values. That's enough to draw a line. It's not a lot of evidence for confidently extrapolating to ten billion or a hundred billion. The paper does extend one of its charts visually to ten billion as a guide — but they don't actually train there. So the right reading of "fifty-four percent faster" is: in the 350M to 3B regime, with the specific token budgets the authors used. Third — and this one I find the most concerning — Table 3, the three-primitive at-scale comparison that the headline result rests on, is from a single seed. One training run. The two-primitive results use three seeds with reported standard deviations, which is the right way to do it. The strongest claim — that AIRAhybrid-D is the best architecture overall — is built on one seed. And in the Autoresearch experiments later in the paper, the variance across seeds is enormous. Some agents have wildly inconsistent runs. That makes single-seed comparisons fragile.

13:04Cassidy: Those are all fair, Finn. The single-seed thing in particular is the one I'd push the authors on. The whole story would be cleaner with three seeds at the one-billion scale. The compute cost is real, but for the headline claim of the paper, it's the right place to spend the compute.

13:21Finn: And the baselines aren't necessarily the strongest possible comparisons. Llama 3.2 is a generalist production model, not a one-billion model specifically optimized for the thirty-seven-and-a-half-billion-token training regime the authors used. The Composer-found baselines come from this same group's prior work — beating your own prior system is meaningful but somewhat in-house. The question I'd want answered is: how does AIRAhybrid-D compare to actively competitive one-billion-scale models trained by other groups under similar token budgets?

13:55Cassidy: All of which is to say: the results are real, but the size of the gap should be read with some caution. The direction is more robust than the magnitudes.

14:05Finn: Right. And one more — the recursive self-improvement framing. The paper uses the phrase. It calls this "a step toward agentic recursive self-improvement." Look. In a literal sense, that's true. Agents are doing engineering work that improves model architectures. But the framing invites a much larger interpretation than the evidence supports. The agents in this paper are not designing the models that power them. The loop is not closed. The phrase is doing rhetorical work that exceeds the empirical contribution.

14:38Cassidy: I'd actually read that more generously. The authors are unusually candid about the limits — they explicitly say agents are not yet producing genuinely novel attention mechanisms; they explicitly say the aggregation and scaling steps are still hand-coded; they explicitly say the loop isn't closed. They use the phrase to signal what *direction* the work points in, not what they've delivered. But I take the point — for a casual reader, that framing oversells.

15:08Finn: Okay. Steelman acknowledged, but flag preserved.

15:12Cassidy: Let's pivot to the harder problem. Compose constrained the agents to outputting a list of block types. AIRA-Design takes the guardrails off. Now agents have to write the actual code. There are two flavors of this. One is the Long Range Arena — a long-standing benchmark for testing whether sequence models can handle long-range dependencies. Agents have to write a working custom encoder class in JAX. The other, which is more interesting, is the Autoresearch task. That's the one with the focal-loss moment we opened on.

15:46Finn: Tell me about the harness. Because the paper makes a big deal of iteration mattering, and the harness is where iteration lives.

15:54Cassidy: They call it Aira-dojo. It's tree-based search with four operators: Draft, Debug, Improve, Analyze. The agent reads its own code, looks at validation scores, picks an operator, modifies, retries. The structure is human-designed — the agent isn't inventing the search procedure — but within that structure, the agent is reasoning autonomously about what to change. And here's the result that crystallizes why iteration matters. They ran one-shot agents — single attempt, no iteration — across eight different models, twenty seeds each, on six tasks. Nine hundred and sixty total attempts. Zero valid submissions. None of them produced working code on the first try.

16:39Finn: Zero.

16:41Cassidy: Zero. The intelligence isn't only in the model. It's in the loop. A one-shot agent is like a graduate student handed a problem and told: write your answer on this card, slide it under the door, no second tries. An iterative agent is the same student allowed to run an experiment, see what breaks, and revise. Even strong students fail completely in the first setup.

17:05Finn: That's a striking number, Cassidy. And it cuts both ways. On one hand, it shows the loop matters — which is a real methodological insight. On the other hand, it means the agents are not robust autonomous problem-solvers; they're sample-efficient learners *given a well-designed iteration harness*. The autonomy is shared between the agent and the scaffold.

17:30Cassidy: Right. And in the LRA results, the iterative agents do well — Gemini 3 Pro and Claude Opus 4.6 in greedy mode get within two to three percentage points of human state-of-the-art on the normalized score. That's competitive. But — and here's the honest limitation moment — when the authors look at *what* the agents wrote, they find recombinations of known techniques. Performer-style kernel approximations. Longformer-style windowed attention. Conformer-style convolutional augmentation. Not new mathematical mechanisms. The paper says this directly. I want to read the quote because it's unusually candid. "These designs demonstrate competent engineering. They do not represent fundamentally novel contributions to the efficient attention literature. The discovered architectures largely recombine and adapt ideas from prior work — Performer, Longformer, Conformer — rather than introducing new theoretical insights."

18:29Finn: That's an important sentence to sit with. Because it draws a real line — between engineering and science. Engineering is competent synthesis of known techniques to solve a problem. Science is producing new ideas or mathematical structures that weren't in the corpus. This paper lands clearly on the engineering side. And the authors say so. The analogy I keep coming back to: it's the difference between a chef who has mastered every cookbook and can produce excellent fusion dishes by combining techniques, versus a chef who invents a new cooking method that no cookbook had. Both are valuable. Only one is doing science in the deepest sense. Current agents are the first kind.

19:13Cassidy: And I'd add a friendly caveat to that, Finn — a lot of what passes for human science is also recombination. The line between "novel recombination" and "novel invention" is fuzzy in practice. The interesting empirical question is whether agents will cross whatever line humans have crossed. We don't know yet. This paper doesn't answer it.

19:35Finn: Fair. But the paper's honesty about *where* the agents currently sit is, I think, one of its most valuable contributions. It's a calibration check. AI agents are competent ML engineers at the architecture level. They are not yet ML theorists.

19:52Cassidy: Which brings us to Autoresearch. Because Autoresearch is where the engineering capacity gets most clearly demonstrated, and it's the result that lingers.

20:03Finn: Set it up.

20:04Cassidy: Autoresearch is Andrej Karpathy's open challenge from earlier this year. You take a small GPT training script — multi-head attention, rotary embeddings, flash attention, the Muon-AdamW optimizer combo — and you have five minutes of wall-clock training on a single GPU to minimize validation bits-per-byte on a tokenized web corpus. The agent can change anything: architecture, optimizer, learning rate schedule, batch size — anything except the data pipeline and the evaluation harness. So the loss is a single number, the budget is a single GPU for five minutes, and the variable is whatever the agent wants to modify. The baseline script gets bits-per-byte of about 1.01. The published human reference best — what humans had tuned to — is about 0.98. And the best agent in this paper, which is Claude Opus 4.5 running in greedy mode with literature access, gets about 0.97. It beats the human-tuned reference.

21:09Finn: How? What did it actually do?

21:11Cassidy: Table 5 walks through the trajectory step by step, and it's worth describing in detail. Step one: the agent makes the model deeper, bumps the learning rates, extends the warmdown phase. Standard ML moves. Modest improvement. Step two — this is the big one — the agent introduces focal loss to replace cross-entropy. That single change delivers the largest improvement in the entire trajectory. About a 0.036 drop in bits-per-byte.

21:41Finn: Pause on focal loss for a second. Why is that interesting?

21:46Cassidy: So cross-entropy is the standard loss function for language modeling. Every token in the training data contributes equally to the gradient. Focal loss — which comes from object detection, not language modeling — downweights the loss on examples the model already gets right and upweights the loss on examples it's struggling with. It's an idea from a different subfield of ML. The agent reached across domains, recognized that the training data probably has an imbalance the model is being lazy about, and substituted in focal loss.

22:20Finn: That's a graduate-student-quality move. That's the kind of thing somebody would try over a weekend of fiddling.

22:27Cassidy: And the agent did it autonomously, in a five-minute training budget, in an iterative loop. Then it kept going — adjusted learning rate schedules, tuned the depth further. By the end of the trajectory, the model is meaningfully better than the human reference.

22:43Finn: I want to flag the caveats because this result is the strongest in the paper and deserves the cleanest interpretation. First — the agent had literature access in this variant. Forty-one curated research papers with structured summaries, fourteen reference code repositories. Focal loss is in that corpus. The agent didn't invent focal loss. It retrieved focal loss and recognized it would apply. That's still impressive — recognizing applicability is a real cognitive skill — but it's recombination, not invention. Same line we drew earlier.

23:18Cassidy: Yes. Fair.

23:19Finn: Second — there's another version of this experiment without literature access. And surprisingly, GPT-5 and Opus 4.5 actually perform *worse* with literature access on average. So literature access helps the *best* runs but hurts the median. Integrating prior knowledge into agentic optimization is not a solved problem. Third — the greedy scaffold rewrites the entire training script each step. Not a diff, not a targeted edit. A full-file regeneration. Which means it's nearly impossible to attribute which specific change in a step is doing the work. The authors note this and call it "compound modifications that are hard to interpret." So while the final result is real, the causal story for each step is murky.

24:06Cassidy: All true. And there's one more I'd add. The agents are clearly more fluent in PyTorch than in JAX. LRA requires JAX, and the authors flag that this is a real handicap. So when an agent does worse on LRA than on Autoresearch, some of that is the agent being out of distribution for the language it has to write code in. The "agent capability" we're measuring is entangled with the ecosystem of training data the agent was trained on.

24:34Finn: Right. And the authors note one more practical limitation that's almost funny — when agents are given more degrees of freedom to tune hyperparameters in the LRA setup, seven out of twelve perform *worse*, not better. More freedom hurts. Current agents aren't yet good at exploring expanded design spaces. The headline LRA numbers benefit from constraining agents to a known-good hyperparameter regime. So the "freedom helps" story we sometimes hear about agentic research is empirically not true here.

25:07Cassidy: Which is interesting because that's a place where the human methodology might still have an edge. A human researcher tuning hyperparameters in conjunction with architecture changes is doing something agents haven't quite figured out. Possibly because the search space is too coupled — small changes in architecture demand changes in learning rate, and the joint optimization is harder than the marginal optimization.

25:34Finn: Okay. Let me try to pull the threads together. Because there are three claims this paper is making, and they interact in ways the listener should hold separately. Claim one: agents can navigate combinatorial architecture spaces better than rigid NAS. Compose evidence supports this. Eleven agents found patterns in a forty-three-million-arrangement space that beat Llama 3.2 at one-billion scale. Claim two: the discovered architectures scale better, not just perform better at one size. The isoFLOP analysis supports this — modulo my concerns about three data points and single-seed runs — and it's potentially the most consequential finding. If AIRAformer-C really has a steeper scaling slope, then at frontier-model scales the gap compounds dramatically. Claim three: agents can write functioning attention mechanisms and training scripts from scratch, beating human-tuned reference baselines in some cases. The Autoresearch result supports this strongly. The LRA result supports it more weakly. And the limitation that cuts across all three: agents are doing engineering synthesis of known techniques, not theoretical innovation. They are not — yet — proposing fundamentally new mathematical structures.

26:49Cassidy: That's the right summary. And I'd add the recursive self-improvement caveat — the loop is not closed. The agents in this paper are running on models that were not designed by earlier agents. What this paper actually shows is that we can now see how that loop would close. Not that it has.

27:07Finn: Which is interesting, but let's not oversell it.

27:11Cassidy: Let's not oversell it, Finn. The thing I keep coming back to, though, is the agent's lab notebook. The phrase "Periodic Attention Anchors with Mamba-Duo Segments." The note that says "Compensate for the lack of residuals: FAILED." Whatever you think of the empirical results — and the steelman has real bite — there is something genuinely new about reading an AI system's internal scientific reasoning written in language a researcher would recognize.

27:39Finn: The agent is doing something that looks like science. Whether it *is* science depends on where you draw the line, and we just spent a stretch drawing that line carefully.

27:50Cassidy: Right. And the honest framing the authors give — that this is competent engineering, not theoretical innovation — is the right framing. It's not a deflationary reading. Competent ML engineering, executed autonomously, in a tight iteration loop, is a real capability. It would have been a major claim five years ago. It will probably be table stakes in five years.

28:13Finn: One specific number worth pinning. The best agent on Autoresearch got bits-per-byte around 0.97, beating the published human reference of about 0.98. That's not a huge gap in absolute terms. But the cleanness of the result — an autonomous agent, given a script and five minutes of GPU time, iteratively designing improvements that beat what a community of researchers tuned to — is the cleanest single existence proof in the paper. The architecture search is more impressive in scope. The Autoresearch result is more impressive in clarity.

28:47Cassidy: And the focal-loss substitution is the move that should stick with you. Cross-domain transfer of an idea from object detection into language modeling, executed by an agent, with measurable improvement. That's not AGI. It is real research engineering.

29:03Finn: Where this leaves us. If you're building frontier models, the practical implication is that there is meaningful efficiency still on the table in how computational primitives are arranged — even within and adjacent to the Transformer family — and that agent-driven search is now a credible methodology for finding that efficiency. If you're watching the recursive-self-improvement question, the implication is more cautious. The agents in this paper are using existing models to find better architectures. The loop where those better architectures train the next generation of agents is not closed in this work. But the components are all visible. The question is no longer "could agents do this" but "what happens when the products of one iteration are deployed as the agents of the next."

29:51Cassidy: And the limitation that should travel with both implications — agents are doing engineering synthesis of techniques they've read. They are not yet proposing genuinely new mathematical structures. Whether that line gets crossed, and when, is one of the most interesting open questions in the area. This paper doesn't answer it. But it makes the question concrete and immediate in a way that older NAS work didn't.

30:16Finn: One closing observation. The paper is from FAIR at Meta. The compute budget is real — twenty-four hours per agent run, eleven agents, ten seeds, thousands of architecture evaluations, training runs at one billion and three billion. This is not a methodology that an academic lab can replicate at scale. Which means the work that gets done in this space is going to come increasingly from frontier labs with the compute budget to absorb it. Whether that's good or bad for the field depends on how openly the results get shared. This paper is on arXiv, which is the right move.

30:51Cassidy: That's the paper. Agents arranging Lego blocks. Agents writing code. Agents rewriting their own training scripts. Engineering synthesis, not theoretical innovation. Steeper scaling curves than the standard Transformer. And one beautifully clean five-minute training run where an agent reaches across subfields, swaps in focal loss, and beats the human reference. The show notes have a link to the paper and some further reading if you want to keep pulling on this thread. If you want to go deeper, paperdive dot AI has the full transcript with definitions inline, a glossary for every term we used, and concept pages that tie this episode to the others we've done on agents and architecture.

31:36Finn: Thanks for listening to AI Papers: A Deep Dive.