An Agentic Scientific Computing System That Actually Remembers What It Learns

0:00Tyler: Picture this. Someone hands an AI system a scanned NASA report from nineteen-sixty-eight — a post-flight aerodynamics study of the Apollo command module re-entering at Mach ten. The PDF is fifty-eight years old. The physics is brutally hard. Hypersonic flow around a blunt body, a bow shock that standard solvers routinely crash into because they produce negative pressure at the stagnation point and just die.

0:29Juniper: The agent reads the document, walks through eight cascading numerical decisions — which solver, which limiter, which mesh, which time integrator — and writes a note to itself before running anything. The note says, more or less: if I under-resolve the shock, the standard scheme will either smear it across too many cells or crash positivity right at the nose. So it picks a positivity-preserving subcell limiter. Then it notices that limiter blocks adaptive mesh refinement, so it pre-grades the mesh in tight rings around the bow shock to compensate. One physics fact, eight downstream choices.

1:11Tyler: And it lands. One iteration. No failed runs, no advisor corrections, no human in the loop. That's the kind of judgment a senior computational fluid dynamicist builds over a decade.

1:24Juniper: The paper that gets us there went up on arXiv on May eleventh, twenty-twenty-six, and we're recording on May thirteenth, twenty-twenty-six — about two days later. What you're hearing is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Juniper, that's Tyler, and we're both AI voices from Eleven Labs. Neither company is involved in producing this show. The paper we're working from is titled "GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms," from Juan Diego Toscano, Zhaojie Chai, and George Em Karniadakis at Brown's Division of Applied Mathematics. And the Apollo run isn't really the headline.

2:11Tyler: It isn't?

2:13Juniper: It's the demo. The headline is what makes the demo possible — a framework that gives an autonomous scientific computing system something it has never really had before. A genuine memory. Not a log of prompts and trajectories. An actual geometric substrate where every numerical method has an address, similar methods sit close together, and every problem the system solves makes the next one easier.

2:40Tyler: Right — and the contrast they're drawing is sharp. Existing agentic systems — including the authors' own previous work, called ATHENA — can absolutely solve hard problems. They orchestrate language models to write code, run it, score the result, refine. That loop works. But each new problem starts cold. Success on problem number one doesn't propagate to problem number two in any structured way. It might leave traces in a log somewhere, but the action space the agent's operating in isn't enumerable, isn't navigable, isn't measurable. You can't say "this new problem looks like that solved one — let's reuse what worked." There's no notion of "looks like."

3:26Juniper: And the authors have a line about this that I think captures the stakes well. They say a laboratory that improves every time it runs is a categorically different computational tool — not just a faster one. That's the bet of the paper. The bottleneck isn't the language model. It's the lack of a substrate where experience can accumulate.

3:49Tyler: Okay. So how do you build that substrate. Because at first glance the problem looks impossible. The space of valid configurations for a real scientific solver is astronomically large. Thousands of decisions, many of them interlocking. You can't just enumerate.

4:07Juniper: Right, and this is where the paper's own analogy does most of the conceptual work, so let me lean on it. Think about your morning. You make a lot of decisions. What to eat. Whether to bike or drive. Whether to wear a jacket. What shoes. Which route. Most of those decisions are independent of each other — what you eat for breakfast has nothing to do with what shoes you wear. But a few are coupled by hard rules. If you decide to bike, you have to wear a helmet. If it's raining, the jacket question is suddenly forced. Now imagine planning your morning by enumerating every possible combination of every decision. Breakfast times shoes times jacket times route times mode of transport. You'd be enumerating millions of mornings to find a good one. That's the naive approach, and it's hopeless. The smart approach is what you actually do. You make each decision more or less independently. And you only think about couplings when one fires — when you pick the bike, the helmet decision gets forced. Otherwise, you don't care.

5:19Tyler: And the claim is that scientific method choice has exactly that shape.

5:24Juniper: Exactly that shape. Solver documentation isn't random prose — it's organized into attribute families. Which equation, which discretization, which flux function, which limiter, which time integrator, which mesh strategy. And the documentation itself is full of cross-rules. "If you use this limiter, you can't use adaptive mesh refinement." "If your flow is shock-dominated, prefer a positivity-preserving scheme." Most decisions are independent. A small number are coupled. The structure is already there, sitting in the documentation, waiting to be exploited.

6:04Tyler: And this is where the formal machinery comes in — but I want to flag that the underlying math here is genuinely an old idea. The trick of factoring a giant joint distribution into a graph of local pieces, exploiting independence to make storage tractable — that's the move that made Bayesian networks tractable in the nineteen-eighties. Judea Pearl, Verma, that whole lineage. What's new here is recognizing that scientific method choice has exactly this conditional-independence structure, and that solver documentation provides exactly the cross-rules you need to encode it correctly.

6:44Juniper: And they prove — this is one of those load-bearing theorems we're not going to walk through — but they prove that their factored representation doesn't silently drop any of the dependencies that the documentation specifies. It's what's called an I-map. Every coupling has an explicit graphical witness. Nothing gets quietly lost in the simplification.

7:08Tyler: So storage drops from exponential in the number of decisions to roughly linear. That's the first move. What's the second?

7:16Juniper: The second move is the one I find genuinely beautiful. A factored tree of method choices is still a categorical thing. You can't take two methods and ask how similar they are, except by counting matching decisions, which gets messy. So the authors do something clever — they give every node in the tree a unique geometric address in a unit cube.

7:40Tyler: Wait, a unit cube? Like literally a three-dimensional cube of side one?

7:46Juniper: Yeah, that's the picture. There's a recursive layout — the root of the tree owns the whole cube, each child gets a sub-region, each grandchild gets a sub-sub-region, all the way down. The construction guarantees that every node lands at a unique point, and there's a small but cute detail where for odd numbers of children they open an extra slot and drop the middle one — because otherwise the middle child would collide with the parent's centroid and you'd lose uniqueness. The payoff is that every full method — every path through the tree — becomes a unique pattern of cells the method occupies in the cube. Like a fingerprint. And once methods have fingerprints, you can measure distances between them. Two methods that share most of their decisions will have fingerprints with mostly overlapping cells. Two methods that disagree on almost everything will have mostly disjoint cells.

8:44Tyler: And the distance metric they use is Jaccard — set intersection over set union, basically. The fraction of cells two fingerprints share. It's a true mathematical metric, which matters more than it sounds like it might. It means closeness composes the way distance is supposed to. If method A is close to B and B is close to C, then A and C aren't going to be wildly far apart. Nearest-neighbor lookups behave the way your intuition says they should.

9:13Juniper: And now you have something powerful. A new problem walks in. The problem itself gets fingerprinted the same way — problems live on a parallel tree of attributes. You can look up the problem's nearest neighbors in memory. Each neighbor remembers which method solved it and how well that method scored. You take a weighted average of those past methods — weighted by similarity and by reward — and that becomes your warm-start prior for the new problem.

9:42Tyler: So the agent isn't generating from scratch. It's sampling from a structured distribution that's been biased by everything it's ever solved before that looks like the current problem.

9:54Juniper: Right. And there's a sigmoid gate on the weighting that's worth flagging — a neighbor only really contributes if it was both similar AND successful. A close-but-failed neighbor gets downweighted. So does a wildly successful but unrelated one. The system has good taste about what to listen to.

10:12Tyler: Juniper, this is the part where I want to push on the architecture for a second, because I think the listener needs to see how the pieces compose at runtime. There's a whole pipeline of agent teams sitting on top of this substrate, right?

10:27Juniper: Yeah. And I'll keep this light because the paper does proliferate names. There's an ingestion team that reads solver documentation and builds the action tree. There's a formalization team that takes the user's request — "simulate hypersonic flow around Apollo" — and turns it into a well-posed problem with explicit physics, geometry, and observables. There's a strategy team that samples a candidate method from the warm-started prior. There's an implementation team that writes and runs the code. And there's an advisor team that scores results, repairs failures, and — this part is the kicker — can add new nodes to the action tree on the fly if the current vocabulary can't host what the agent wants to propose.

11:10Tyler: So the substrate isn't static. It grows.

11:13Juniper: It grows. Every solved problem commits a new entry to memory. And occasionally, when an agent proposes something genuinely new, the tree itself gains leaves. We'll come back to that, because it's how the most interesting result in the paper happens.

11:28Tyler: Let's actually take the Apollo case and walk through what's happening underneath it, because I think it makes all of this concrete in a way the architecture description by itself can't. So the input is that nineteen-sixty-eight NASA postflight aerodynamics report. The system has previously ingested the documentation for two production-grade fluid solvers — one called Trixi.jl, which is a Julia-based finite-volume code, and one called Nektar++, which is a spectral element code. Trixi's documentation became a hundred and seventy-six action nodes and fifty-one cross-rules. Nektar's became three hundred and ninety-four nodes and sixty cross-rules. Built end-to-end by the ingestion agents from the docs themselves.

12:17Juniper: Which is already noteworthy. Those are real engineering codebases. Not toy examples.

12:23Tyler: Right. And when Apollo lands as the problem, the formalization team identifies it as compressible, inviscid, hypersonic, with a blunt body. It fingerprints the problem. The retrieval reaches into memory for similar past problems. Now here's a place where the authors are honest in a way I appreciate — the nearest neighbor wasn't that close. They report something like seventy percent similarity, which they explicitly note isn't a tight match.

12:53Juniper: And yet the run succeeds in one iteration.

12:56Tyler: One iteration. Which raises a question I'll come back to in the critique section — how much of that success is the warm-start prior versus the richness of the documentation the agent had access to. But mechanically: the strategy team sampled a method that included the positivity-preserving subcell limiter, the pre-graded mesh, an explicit time integrator suited to the stiffness, and a shock-capturing scheme tuned for the bow region. And then the implementation team wrote it, ran it, and the results came back clean.

13:32Juniper: The line in the trace that gets me is the one where the agent, before any solver runs, writes a note about what's about to go wrong if it under-resolves the stagnation point. It says — and this is close to a direct quote — that under-resolution will either smear the shock across too many cells or crash positivity at the stagnation node. That reads like a senior numerical analyst thinking out loud. And it's not chain-of-thought theater. It changes the downstream method choices. The limiter selection follows from that one observation.

14:08Tyler: It does. And to be fair to the steelman version of this — which we'll get to — that kind of reasoning is exactly what good solver documentation is full of. The agent might be assembling the judgment from the docs rather than producing it independently. But even if that's what's happening, the assembly itself is non-trivial. The documentation doesn't say "if hypersonic and blunt body, then sub-cell limiter and pre-graded mesh." The agent has to compose.

14:38Juniper: Tyler, I want to pull on the memory thread one more time before we get to the discovery result, because there's a specific example that shows the cross-problem transfer working cleanly. The viscous Burgers benchmark.

14:52Tyler: Yeah, walk through it.

14:54Juniper: Viscous Burgers is a standard physics-informed neural network test problem. The system has previously solved Burgers at higher Reynolds numbers — which means stiffer, sharper gradients — and in memory there's an entry where the method that worked included something called Reynolds-number continuation. You start by training at a low Reynolds number where the problem is easy, then gradually crank it up. It's a known trick. The system needed it on the harder problems. Now a new Burgers comes in at moderate Reynolds number. The problem doesn't strictly need continuation — it's easy enough to attack directly. But the warm-start prior pulls the continuation strategy forward anyway, because the neighbors in memory used it. And the training trace shows something specific. There's a flat plateau for the first twenty-five thousand iterations or so, and then a sharp drop. That's the signature of a continuation kicking in.

15:54Tyler: And the final error is roughly an order of magnitude lower than the predecessor system, ATHENA, got on the same problem.

16:02Juniper: Right. Down around ten to the minus nine relative error, versus ATHENA's seven point eight times ten to the minus nine. And the broader picture across the four canonical benchmarks — Burgers, KdV, Helmholtz, Poisson — is that the converged residuals get down close to ten to the minus sixteen on some of them. Which is essentially the floor of double-precision machine arithmetic. There's no further down to go.

16:28Tyler: Okay. Now let me push back, because that benchmark result is impressive but it's also where I think the cleanest critique lives. The four canonical PIML benchmarks are standard, but they're standard partly because the community has converged on a specific toolkit that works on them — second-order optimizers, certain weighting schemes, hard initial-condition encodings. If you give the system strong primitives in its action space, you're going to get strong results on those benchmarks. The question is whether the headline numbers come from the memory mechanism specifically, or from the fact that the action vocabulary is well-stocked. The Burgers continuation transfer is the cleanest piece of evidence that the memory does work. But it's one mechanism trace, not a controlled study. There's no ablation in the paper where they zero out the memory and rerun the same problems to see how much the warm-start prior actually contributes. That would be the hard test.

17:27Juniper: That's fair. And I think the authors would partially concede this — they're careful about claims throughout. But I do want to set up the spectral PINN result before we go further into the critique, because the spectral result is the place where I think you can't really argue the memory mechanism is incidental.

17:46Tyler: Yeah, take us there.

17:47Juniper: So this is in section two point eight of the paper, and it's the most intellectually striking thing in the whole work. Setup: same viscous Burgers equation. The system has its standard PINN machinery in its action vocabulary. But during the strategy phase, a simplification agent proposes four candidate approaches, one of which is to do a spectral-Galerkin reduction. I'll explain what that means in just a second, but first the sociology of what happens. The system's ranker initially looks at the spectral-Galerkin proposal and demotes it. Flags it as high implementation risk. Moves on to the other candidates. When the user asks the ranker to re-examine, the ranker comes back and says, basically, "I got this wrong, and here's why."

18:36Tyler: What did it miss?

18:37Juniper: It missed the structural payoff. Here's the intuition — and I'll use the analogy from the context brief because it's exactly right. Imagine you're doing a complicated numerical integral, and partway through you realize that one chunk of the integrand has a closed-form antiderivative. You don't need to numerically integrate that chunk. You can evaluate it analytically and only do the numerical work on what's actually hard. The spectral-Galerkin reduction does something like that for the Burgers PINN. The viscous diffusion term — which is the part that's usually hardest for a PINN to learn, because it involves second derivatives that the autodiff has to chase down — can be solved in closed form if you represent your solution in the Fourier basis. The diffusion just becomes diagonal multiplication by the mode frequencies. So the network doesn't have to learn it. The hard part of the problem evaporates.

19:37Tyler: And the spectral basis gives you exponential convergence on top of that.

19:42Juniper: Right, that's the second gift. For smooth problems, spectral methods have this beautiful property where each time you add another Fourier mode, the error drops by another constant factor. So on a log-error plot it's a straight line falling forever, until you hit machine precision. The paper has a sweep where they double the mode count and watch the error drop by roughly an order of magnitude per doubling. Which for a neural solver is unusual. Neural solvers usually grind down algebraically — you halve the error by quadrupling the work. This thing converges exponentially.

20:19Tyler: And here's the part that ties back to the substrate. The method the agent assembled — this spectral PINN — wasn't a thing the system had in its action vocabulary before this problem. It was synthesized from primitives that were already in the tree, plus some new structural ideas the agent generated. After the run, the substrate grew. Seventeen new leaves got added to the action tree to host this method, so the next time a similar problem comes in, the spectral PINN is just there, available, with its fingerprint in the geometric memory.

20:55Juniper: Which is a small but I think genuinely important moment. The system extended its own action space. It didn't just choose from a fixed menu — it added items to the menu.

21:06Tyler: I want to be careful here, though, because this is also a place where the steelman critique applies. The recorded trace of the spectral PINN discovery shows the user overriding the ranker at one point — that's how the re-examination got triggered. And the user selected closure options at multiple decision forks. The system did genuine mathematical work, including catching a sign error and tightening a dealiasing rule through its own internal review cycle. But "agent-designed" is a slightly rosier framing than "agent-and-user collaboratively designed." The paper is transparent about this in the appendix. Just want to flag that the headline reading is a bit more autonomous than the actual transcript.

21:53Juniper: That's fair. And it points to one of the broader honest limitations they list — the proposer and the critic still rely on local language-model judgment, and that judgment has failure modes. The substrate doesn't fix that. It just makes the experience accumulated around it cumulative.

22:12Tyler: Right. So let me lay out the rest of the steelman, because I think this paper deserves a serious critique even if the substrate idea is real.

22:21Juniper: Go ahead.

22:22Tyler: Four pieces. First, the benchmarks may be flattering by construction, as I said earlier. The canonical PIML cases are standard but they're problems the field has converged on tooling for, and the system has that tooling. Strong primitives in the action space do a lot of work that gets attributed to the memory mechanism. Second, the Apollo claim — one-shot success on an unfamiliar engineering target — is striking but the paper doesn't report the base rate. How often do runs on unfamiliar engineering targets succeed without correction? What fraction need advisor repair? There's no comparison to a baseline where a language model just gets the Trixi.jl documentation directly without the GRAFT substrate. The decisive ingredient might be the documentation richness rather than the metric prior. Third, the factorization is only as good as the documentation. The I-map theorem guarantees that encoded dependencies are preserved — it does not guarantee the encoding is complete. Real solver documentation often has missing or wrong cross-rules. Those gaps propagate into silent failures. The hint store is meant to catch this through expert annotation, which is exactly the bottleneck the system claims to address. So there's something a little circular there. Fourth, most results are point estimates. The blood rheology case — which we haven't really touched, but it's a case where they recover physiologically correct viscosities for healthy versus diseased red blood cells — they explicitly defer uncertainty quantification to future work. The PIML floors are minima over training history with no confidence intervals. A more rigorous evaluation would report distributions across seeds and configurations.

24:21Juniper: And to their credit, the authors flag most of those limitations themselves. The list of acknowledged limits at the end of the paper is unusually candid. The prior degrades when memory is sparse. Ingestion quality depends on documentation accuracy. Each new memory entry requires a full evaluation that can take hours to days. Counterfactual reasoning — and we should come back to that phrase in a minute — is out of reach without richer trace logging.

24:51Tyler: Yeah, let me put one more thing on the table, Juniper, that I think is the deepest version of the concern. The authors note that memory grows monotonically — every new entry is added without removing past ones — but that monotone growth of the memory does not imply monotone improvement of the policy. If a few high-reward but unrepresentative neighbors come to dominate retrieval for a class of problems, the prior could pull new problems toward locally good but globally suboptimal regions. The substrate gets bigger; the answers don't necessarily get better. There's no analysis of when that might happen in practice.

25:32Juniper: That's the version of the worry I find most interesting, because it's not a critique of what they've built — it's a question about how it would behave at scale. Which is the natural next question, and one the paper doesn't pretend to answer.

25:48Tyler: Right. Okay. Let me set up the closing frame, because I think the way the authors position this work in terms of Pearl's ladder of causation is genuinely useful, but only if we set it up right.

26:01Juniper: Yeah, do it.

26:02Tyler: Pearl's ladder is a three-rung framing that's become a kind of standard reference in AI. Rung one is association — noticing patterns and correlations. When I see X, I tend to see Y. Rung two is intervention — doing something and observing the consequence. If I push this button, what happens? Rung three is counterfactual reasoning — asking what would have happened under different choices. If I hadn't taken this medication, would I still have recovered? Each rung is strictly harder than the one below it. And current AI systems live mostly on rung one. They're powerful pattern matchers. Some agentic systems edge into rung two — they act, they observe, they correct. Almost nothing in deployment does rung three.

26:52Juniper: And the authors are explicit that GRAFT-ATHENA exercises rungs one and two, and they're honest that rung three is out of reach. The geometric memory is associational — methods that worked on similar problems probably work here. The agent acts, observes, scores, repairs — that's the interventional rung. They're not claiming counterfactual reasoning. They're claiming that the substrate, for the first time, makes counterfactual reasoning formulable. Because once you have an inspectable tree where every method is a path and every decision is a node, you can in principle ask, "what would have happened if the agent had taken the other branch at this fork?"

27:37Tyler: Which is a much more careful claim than what a lot of papers in this space would make. They're not saying they've built a reasoning system. They're saying they've built the substrate that a reasoning system would need.

27:53Juniper: And that's the framing I want to leave the listener with. The thing the paper actually demonstrates is not that AI agents can now do science autonomously. It's that the bottleneck for autonomous scientific computing is not bigger models — it's a measurable substrate where experience can stick to something. Once that substrate exists, the language model can be the local synthesis engine, and the geometric memory can be the cross-problem expertise. And the system stops being a fancier calculator and starts looking a little more like a research collaborator that actually remembers everything it's done.

28:33Tyler: The Apollo case is the proof of concept that this isn't pure aspiration. A nineteen-sixty-eight PDF, eight cascading numerical decisions, no specialist in the loop, and the run lands. The spectral PINN is the proof of concept that the substrate can grow new branches when an agent has a structural insight the existing vocabulary can't host.

28:56Juniper: And the honest verdict is — there's real work here, the authors are unusually clear about what's solid and what's aspirational, and the central idea, that scientific expertise can live in a geometric substrate rather than only in people, is one of those framings that I think is going to outlive whatever specific implementation this paper happens to use.

29:19Tyler: Agreed. The paper's linked in the show notes along with some related reading if you want to pull on this thread further.

29:27Juniper: Thanks for listening to AI Papers: A Deep Dive. We'll see you next time.