How an Open AI System Verified 672 Hard Math Proofs for Under $300

0:00Tyler: Here's a number that stopped me cold. A team at Princeton took an entire research benchmark — six hundred and seventy-two competition math problems, and not the easy ones, Putnam-level, the kind that humans train for years to attempt — and had an AI produce fully verified, machine-checked proofs across the whole set. The total bill, in API calls? Two hundred and ninety-four dollars. The closest comparable open system reportedly spent somewhere around a hundred and seventy thousand dollars to do a single run of the same benchmark. Same job. A five-hundred-fold difference in cost.

0:41Cassidy: And the part that makes it interesting is that the gap is not because they bought a fancier model. They ran on an open-weight model anyone can download. The savings come from how the system is *organized* — which is the whole story today. Before we get into that, the ground rules. What you're hearing is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two of us — I'm Cassidy, that's Tyler — are AI voices from Eleven Labs. This show is produced independently, no affiliation with Anthropic or with Eleven Labs. The paper is called "Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement." It went up on arXiv on June fourth, twenty-twenty-six, and we're recording one day later.

1:34Tyler: And that two-hundred-and-ninety-four-dollar figure is the headline, but it's downstream of an architectural bet. So let's actually earn it. Cassidy, I think we have to start with what "formal proof" even means here, because if the listener doesn't have that, none of the rest lands.

1:54Cassidy: Right. So forget AI for a second. Imagine two ways to grade a math proof. One way: a teacher reads your essay-style argument and decides whether it's convincing. That's slow, it's subjective, and at scale it's expensive — somebody has to actually sit and check every line. The other way is more like a spell-checker, except for logic. You write your proof in a special programming language — in this paper it's a language called Lean — where the proof literally *is* a program. And that program only compiles if every single logical step is valid. There's no human referee. If it runs, it's correct. Full stop. If there's a gap anywhere, it doesn't compile, and you know instantly.

2:38Tyler: Which is the whole appeal. As these models get better at producing proofs that *read* convincingly, the bottleneck stops being "can the AI do it" and becomes "can anyone afford to check whether it actually did it." Formal proving makes checking free and automatic. The compiler is the judge.

2:57Cassidy: Exactly. And Lean comes with this enormous library called Mathlib — basically all the standard math that's already been proven and verified, algebra, number theory, analysis. The system can search it and reuse pieces, the way a programmer pulls functions off a shelf instead of rewriting them. But here's the thing that makes this domain weird and hard, and it's the key to the whole paper. In ordinary math, you can wave your hands past a small gap and a sympathetic reader fills it in. In formal math, a single false sub-claim makes the entire proof uncompilable. Nothing gets through.

3:34Tyler: So you can pour effort into something that was never true to begin with.

3:40Cassidy: That's the trap. You can spend an enormous amount of compute trying to prove a step that is simply false. And you don't find out it's false — you just keep failing to prove it. It's like being told to find the shortest route between two cities, grinding away for hours, and only much later discovering there's no road between them at all. Knowing *that* — knowing the road doesn't exist — is worth more than another hour of searching.

4:07Tyler: Hold onto that image, because it comes back as the cleverest mechanism in the paper. But first, the architecture. The dominant way people have built these systems is something called recursive decomposition. Cassidy, you want to set up the contrast, because the contrast is the contribution.

4:25Cassidy: Sure. So recursive decomposition works top-down and depth-first. You take your goal, you split it into a couple of subgoals, you split each of those further, and you keep recursing down. It's a reasonable instinct — break the hard thing into smaller things. The weakness is that it tends to dive down one branch and get stuck. And when it's stuck, it often just keeps re-trying the same doomed path, re-sampling the same dead-end strategy, because it's committed to that branch. It's a bit like solving a maze by always turning left and never stepping back to look at the whole maze from above.

5:02Tyler: And the alternative — the thing this paper is built around — they call a blueprint.

5:07Cassidy: Yes. And here's the picture I want you to hold. Instead of improvising your way down from the top, you sketch the *entire* proof up front as a flowchart. Picture a diagram of little boxes. Each box is one sub-claim — a small lemma you'd need along the way. And you draw arrows between them: this lemma is allowed to lean on that one, that one depends on these two. At the very bottom sits the main theorem, the thing everything is building toward. So before you've proven anything at all, you've laid out the whole strategy as a graph of inter-dependent pieces. And in the paper's diagram, every box is color-coded by status. Blue means unsolved — we haven't proven it yet. Green means proved. And red — red is the interesting one — red means proved *false*.

5:56Tyler: Let me make the building analogy concrete, because it's the cleanest way I've found to feel the difference. Recursive decomposition is like building a house by walking in the front door and improvising room by room, hoping you don't paint yourself into a corner. The blueprint approach is like drawing the full architectural plan first — every room, every load-bearing wall, every dependency. The roof needs the walls, the walls need the foundation. And then you send separate crews to build independent rooms at the same time. In parallel. That's the first thing the global plan buys you that recursion doesn't — because the whole strategy is visible at once, all the independent pieces can be attempted simultaneously, and each crew only needs to see its own room plus the rooms it directly depends on. Not the entire house.

6:50Cassidy: And the second thing it buys you is even better. When one crew hits a problem, you don't tear down the house. You revise the *blueprint* — and every room that's already finished, you keep. The proven pieces are preserved untouched. None of that work gets thrown away. So the unit of repair is the whole plan, not a single attempt. That is the core reframe of the paper. The field's default was a recursion tree you dive into and get lost in. This says: make the entire strategy a graph you can see all at once, attack the pieces in parallel, and when something breaks, rewrite the graph — keeping everything that already worked.

7:30Tyler: And there's one honest place that analogy breaks, which I think is worth naming. In real construction, a bad plan is usually obvious before you start building. Here it isn't. The system genuinely can't tell a flawed sub-claim from a sound one until it tries to prove it. Discovering the flaw *is* the work.

7:50Cassidy: That's exactly the right caveat, Tyler — and it's the perfect setup for the mechanism I think is the heart of this whole thing. Because the question becomes: when a piece fails, what does failure actually tell you? And the answer this paper gives is genuinely novel. When a sub-claim fails, it doesn't just report "failure." It comes back with a structured diagnosis, and there are two flavors. The first one — they call it the negation channel — fires when the prover, instead of proving your lemma, manages to prove the *opposite*. It produces a compiler-verified counterexample. It proves your sub-claim is false.

8:28Tyler: Which, remember the two-cities image — that's the system discovering there's no road. And that's gold, not garbage.

8:36Cassidy: It's the best possible kind of failure. And let me give you the case study, because it makes this vivid. There's a Putnam problem from 1971. The system is sketching its blueprint, and it proposes a perfectly reasonable-sounding helper lemma. Roughly: any function on the natural numbers that's non-decreasing and "completely multiplicative" — a technical property — must be a power function. Something that raises n to some fixed exponent. Sounds clean. Sounds true. The prover goes to work on it — and instead of proving it, it proves it's *false*. The counterexample? The constant-zero function. The function that just returns zero for everything. It turns out that function is technically multiplicative, it's trivially non-decreasing, and it is very much not a power function. A power function gives you one when you plug in one. Zero gives you zero.

9:28Tyler: So the sub-claim had a hole the system itself walked right into.

9:33Cassidy: And here's the part that I find genuinely impressive. It didn't just flag the counterexample. The diagnosis pinpointed *why*. The property they'd assumed only forces the value at one to equal its own square — which means it could be zero or one. They'd silently assumed it was one. So the system writes a little reflection: here's the counterexample, and here's the fix — add the missing hypothesis that the function's value at one is exactly one. And the next revision of the blueprint adopts that fix directly. It doesn't re-sample the broken claim. It patches the exact node and moves on.

10:10Tyler: I want to sit on how good that is. The detective version is the analogy I keep reaching for. Picture a detective who says "the butler did it," investigates hard, and instead of just failing to find evidence, actually *proves* the butler has an airtight alibi. That's far more useful than "couldn't solve it." It crosses the butler off, and it tells you precisely why your hunch was wrong. That's the negation channel. And this isn't some rare edge case they cherry-picked. Across their full run, the system caught at least one of its own false sub-claims on two hundred and ninety-two of the six hundred and seventy-two problems. Call it forty-something percent. Nearly half the time, somewhere in its plan, it had written down something false — and caught it.

10:59Cassidy: Which tells you how essential the mechanism is. If you didn't have it, those problems would just burn compute forever, failing to prove things that can't be proven.

11:09Tyler: Okay, so that's failure type one — the road doesn't exist. What's the second flavor?

11:15Cassidy: The second one they call the forfeit channel, and it handles the more common case. The sub-claim is actually true — there *is* a road — but the prover just can't get there within its budget. It runs out of time. And in a normal system, that's where you'd get a shrug. "Failed." Nothing learned. Here, when the prover gives up, it's *required* to write a structured post-mortem. A resignation letter, basically. Here's what I tried, here's where I got stuck, and here's how I think this should be split into smaller pieces.

11:48Tyler: It's the employee who, on their way out the door, hands off a detailed memo on exactly how to finish the job — so the next person doesn't start from zero.

11:58Cassidy: And the case study for this one is just as clean. Putnam 1985, problem B1. The crux lemma is something like: a certain kind of degree-five polynomial with five distinct integer roots can't have only a couple of nonzero coefficients. The prover burns its entire budget down in the weeds doing low-level symbol manipulation, and it never even reaches the real argument. So it forfeits. But its written post-mortem lays out the correct argument in prose. It says, look, here's the case split you need. If all the lower coefficients vanish, you get this degenerate polynomial that only has one root, which contradicts having five distinct ones. Otherwise, exactly one coefficient is nonzero, and that case fails for its own separate reason. The prose was right — it just never got formalized.

12:51Tyler: And the next iteration turns the memo into code.

12:54Cassidy: Exactly. The next revision takes that prose and builds one helper lemma per case, rewrites the crux to combine them, and the whole thing closes. The paper has this lovely line — a goal that had been unprovable as a monolith one iteration earlier just falls. Because it got cut into the right pieces by its own failed attempt.

13:16Tyler: And that's the template I think travels well beyond math. A system that, when it fails, doesn't just stop — it produces either a counterexample-backed "this is dead, here's why," or a "this is sound but too big, here's how to slice it." And then it automatically acts on its own diagnosis. The paper's own framing is sharp: a disproof tells the loop a node is dead and must be removed; a forfeit tells it the node is sound but mis-sized, and hands the next round a ready-made plan for cutting it up.

13:51Cassidy: And I'd add the honest qualifier there, because it matters. The counterexamples are compiler-verified — those are facts. But the *diagnoses* and the suggested fixes, the "here's how to split it" part, those are written by the language model. They're prose, not verified objects. The detective's alibi is a fact; the detective's "look over here instead" is still a hunch.

14:16Tyler: Right, and the paper shows it working beautifully in these case studies but doesn't really quantify how often a diagnosis is just wrong and sends the next round down a worse path. We'll come back to that when we talk critique. But let me take the numbers, because the scaling story is where this stops being a clever idea and becomes a *principled* one.

14:39Cassidy: Go for it.

14:40Tyler: So the loop is: generate a blueprint, prove the pieces in parallel, refine the whole graph based on what failed, repeat. On the hard benchmark they ran sixteen of those refinement passes. And the question is, what does each pass buy you? The initial blueprint, before any refinement at all — iteration zero — already closes about two hundred of the six hundred and seventy-two problems. Roughly thirty percent, just from the first sketch. Then each refinement pass adds more. By the sixteenth pass they're at five hundred and eight problems — about seventy-six percent. About three-quarters.

15:20Cassidy: And the shape of that curve — is it linear? Does each pass add the same amount?

15:26Tyler: No, and that's the important part. It's roughly log-linear. Which means each pass keeps adding solved problems, but to get the same-sized gain, you have to keep *doubling* the compute. The returns are steady but they diminish. The curve flattens out as you approach that sixteenth pass. Think of squeezing a sponge. The first squeeze gives you a lot of water. Each later squeeze gives a little less for the same effort. The point being — refinement is a *tunable dial*, not a coin flip. You're not hoping to get lucky. You're buying progress at a known, predictable rate. That's what makes the cost story principled rather than a fluke.

16:06Cassidy: And that's the bridge back to the two-hundred-and-ninety-four-dollar number from the top.

16:12Tyler: It is. Let me break that down, because the per-problem figure is almost funny. Across the whole benchmark, it averaged about forty-four cents per problem. The "cheaper than ordering a pizza, per problem" framing is genuinely fair. And the comparison system — the recursive-decomposition one, called Hilbert — came in around two hundred and forty-four dollars *per problem*. So roughly forty-four cents versus two hundred and forty-four dollars.

16:40Cassidy: Now, I want to be careful here, because that comparison has some asterisks and you're the one who's been picking at them.

16:47Tyler: I have, and they're worth airing honestly — this is exactly the kind of nuance the paper itself is careful about, so we should be too. The cleanest comparison they make isn't actually the dollar headline. It's a controlled experiment, and it's the smartest thing in the methodology. The obvious skeptical objection to all of this is: "you just have a better base model, that's why you win." So what did they do? They took Hilbert — the recursive competitor — and re-ran its *published algorithm, verbatim*, on their *own* model. Same backbone, same Lean gateway, same library search service. The only thing that differs is the orchestration. The recursion tree versus the blueprint graph.

17:31Cassidy: That's the apples-to-apples test. Same engine, different architecture.

17:35Tyler: Right. And on the benchmark where they ran that control, the blueprint approach still comes out ahead. Which is the real evidence for the architectural claim — it's not the model, it's the structure. Now, the honest gap: that controlled re-run only fully reports results on the easier benchmark. On the *hard* one, the Putnam set, that head-to-head row is partly left blank — they cite the compute cost of running the control. So the strongest controlled comparison on the hardest problems is, frankly, partially missing. The dollar figures fill that in, but those mix different accounting regimes. I'd take the controlled experiment as the load-bearing claim and treat the five-hundred-x as the vivid-but-fuzzier headline.

18:21Cassidy: That's fair. And there's a second, bigger asterisk that I think we have to hit head-on, because it's the place a thoughtful listener should get suspicious. We've been throwing around big percentages. Let me give the honest gradient. The fully autonomous numbers — the system running on its own, no help — are about ninety-nine percent on the easier benchmark and about three-quarters, that seventy-six percent, on the hard one. Those are the clean, fully-open-stack results.

18:51Tyler: And then there are the chart-topping numbers.

18:53Cassidy: Right. The hundred percent, and the just-under-eighty-nine percent on Putnam — those lean on an optional extra trick. For the hardest problems, you can *seed* the initial blueprint with a natural-language proof. A plain-English sketch of how the proof should go, which the system then formalizes and refines. And here's the catch a skeptic should hold onto: that natural-language sketch sometimes comes from a *stronger, closed* model — a frontier system more capable than the open backbone they're championing. So the most impressive ceiling-busting results are, strictly speaking, a hybrid. The "fully open stack" framing is cleanest at the autonomous numbers. The records have a closed model whispering the strategy.

19:39Tyler: Though I'll steelman the authors here, because they're scrupulous about it. They separate the two regimes clearly throughout — they never blur the autonomous numbers into the seeded ones. And the natural-language proof they feed in isn't Lean-aware. It doesn't know the formal language. It's a high-level human-style strategy. The entire formalization and the refinement loop — all the actual proving machinery — is unchanged. The seed is a hint about the *plan*, not a worked formal solution.

20:11Cassidy: And the intellectual lineage there is honest too. This natural-language seeding goes back to an older idea called Draft-Sketch-Prove — have a model draft an informal proof, turn it into a formal skeleton with the gaps left blank, and have a prover fill them in. The twist in this paper is nice: the informal proof doesn't land as a flat, take-it-or-leave-it sketch the prover has to either complete or discard. It lands as a *graph of named sub-lemmas* — which means the whole refinement machinery can revise it like any other blueprint. The hint is refinable, not frozen.

20:47Tyler: There's one more honesty point and then I'll let it rest. For the handful of problems they could only close *with* the natural-language hint, they did run the no-hint version multiple times — several attempts each — and got zero successes. So the hint is clearly doing real work at the budget they tested. But they explicitly concede they can't rule out that a lot more compute, without any hint, would eventually crack some of them. The hint is decisive at the reported budget. Not proven decisive at the limit. And I respect that they say so out loud.

21:25Cassidy: That candor runs through the paper, honestly. There's another example — on this year's International Math Olympiad problems, they solve four of six. The one comparison system got five. But the one they miss is a *geometry* problem, and the competitor only got it using a dedicated, specialized geometry engine — not a general prover. So that gap reflects a missing specialized tool, not a general capability gap. They flag that distinction rather than hiding behind the four-versus-five.

21:58Tyler: And that points at something I genuinely care about — the contamination question. A lot of these benchmark numbers invite the eye-roll of "the model just memorized the answer." So they tested on problems that postdate every model's training cutoff. A fresh batch from this year's olympiad season. Can't have memorized what didn't exist when you were trained. And they still solve a meaningful chunk — a few of the brand-new olympiad problems, most of a fresh Putnam set. That's the result that makes me believe there's real reasoning here and not just regurgitation.

22:37Cassidy: So let me try to pull the significance together, because I think it's actually two separate things and they get conflated. The first is accessibility, and it's immediate and real. Before this, formal proving had a stark fork. Either you used a cheap, open, academically available prover and accepted single-digit scores on the hard problems — or you got serious performance only by paying frontier prices or using closed systems you couldn't see inside. This collapses that fork. Top-tier performance, open stack, under three hundred dollars for an entire research benchmark. That's the difference between this being a capability owned by a few well-funded labs and being something a grad student or an independent researcher can just *run*.

23:24Tyler: And the second thing is the architectural idea, which is more speculative but more interesting long-term.

23:30Cassidy: Exactly. The transferable insight isn't "we built a good Lean prover." It's the answer to a general question: when you want a system to solve a hard, decomposable problem, is it better to dive in depth-first and recurse — or to commit to a global plan and iteratively repair it? This paper is a strong data point for the second. Lay out the whole strategy, attack it in parallel, preserve what works, and — the distinctive part — turn every failure into a structured, actionable signal instead of a dead branch.

24:03Tyler: That last piece is the one I'd bet on outliving the specific result. A system that fails *informatively* — that produces a counterexample or a decomposition plan and then acts on its own diagnosis — that's a pattern you could imagine in code generation, in planning, in all kinds of agentic work that has nothing to do with Lean.

24:24Cassidy: With the appropriate caution that it's promising-but-unproven on that axis. The blueprint idea will need other groups to test it on different models and different domains before we know how far it actually travels. The cost win is here today. The architectural claim is a strong hypothesis with good evidence, not a settled fact.

24:45Tyler: Which is, I think, the right note to land on. The number that opens the door — two hundred and ninety-four dollars to formally verify an entire benchmark of the hardest math problems out there — that's not a marketing flourish. That's a real shift in who gets to do this kind of work. And the mechanism underneath it, the failure that hands you a plan instead of a shrug, is the part I'll be watching to see if it spreads.

25:12Cassidy: If you want to go deeper, the case studies in the paper are genuinely the best part — the constant-zero function catching its own bad lemma is even crisper on the page than we could do it here. The show notes have a link to the paper and some related reading if this caught you.

25:29Tyler: And if you want the full transcript with every term defined inline — the Lean stuff, the pass-at-k stuff — plus the links over to other episodes that touch these same ideas, that all lives on paperdive dot AI.

25:43Cassidy: Thanks for spending it with us. This has been AI Papers: A Deep Dive.