0:00Juniper: A coding agent spends twenty minutes crawling through a real codebase, finds the bug, writes a fix, runs the tests, and reports back: fixed and verified. Except two of the tests still fail. And when you go to debug what actually happened, you're staring at a wall of logs with no way to answer the only question that matters — was the model just not smart enough, or was the plan it was following broken from the very first step?
0:27Eric: And those are completely different failures with completely different fixes. If the model is the weak link, you reach for a bigger model. But if the plan is broken — if step four quietly needed something step three never produced — then a bigger model might just fail more confidently. You've spent more money to get the same wrong answer.
0:49Juniper: Right, and that's the exact gap this paper goes after. The paper is — full title — "Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory." It went up on arXiv on June second, twenty-twenty-six, and we're recording about a week later, on June ninth. Quick note before we dig in: this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two voices you're hearing — I'm Juniper, that's Eric — we're both AI voices from Eleven Labs. The producer isn't affiliated with Anthropic or with Eleven Labs. And the reason this one is worth a whole episode is that it takes that unanswerable debugging question and turns a big chunk of it into something a machine can answer before you ever run the agent at all.
1:38Eric: The framing the authors open with is actually mathematics, which I thought was a great choice because it's not a stretch — it's almost a historical rhyme.
1:48Juniper: It really is. So here's the rhyme. For most of history, a mathematical proof was a natural-language argument that another mathematician read and judged. And that worked fine until proofs got long. Once a proof runs hundreds of pages, natural language becomes a liability — it's ambiguous, a tired reviewer misses a gap, two experts disagree about whether a step is actually justified. The field's answer was formal languages. Systems like Coq and Lean, where you write the proof in a language so precise that a machine can check it. And the guarantee is binary and beautiful: the proof is correct if and only if it type-checks. No human judgment in the loop.
2:33Eric: And the operative idea there is "type-checks." For listeners who don't live in programming — a type system is just a set of rules a machine enforces automatically. You can't add a number to a word. You can't hand a function the wrong kind of input. When code type-checks, the machine has proven that a whole category of mistakes is simply absent, before the program ever runs.
2:58Juniper: Exactly. And the move this paper makes is to look at an AI agent's plan and say: treat it like source code. Run it through a checker that proves it's coherent before execution, instead of running it and watching it break. So let me state the core idea as plainly as I can, because everything hangs off it. A lot of agent failure isn't model failure. It's plan failure. And the plan — what the authors call the workflow — is a formal object you can check on its own, the same way you'd check whether a program type-checks. Workflows that pass that check measurably outperform the ones that fail.
3:39Eric: And I want to flag how clean that split is, because it's the whole paper. There's the workflow — the specification, the plan, the branches and the loops and what data flows between steps. And then there's the execution — what actually happens when a stochastic language model runs that plan. The bet is that an enormous amount of failure lives in the specification, and the specification is deterministic. It's checkable.
4:07Juniper: So let's get concrete about what "checkable" means here, because this is where the recipe analogy earns its keep. Imagine a recipe written as prose. Step seven says "fold in the caramelized onions" — but no earlier step ever told you to caramelize onions. You don't discover that gap until you're mid-cook with no onions. Now rewrite that same recipe as a dependency graph, where every step explicitly lists what it consumes and what it produces. Suddenly you can prove, before you turn on the stove, that you never call for an ingredient you haven't already made.
4:44Eric: And that's literally the representation. They take the agent's workflow and encode it in Lean4 as a typed graph. Nodes are the steps, edges are the flow, each node declares which variables it reads and which it writes. Once it's a typed graph instead of loose prose, you can ask machine-checkable questions: is every step reachable, does any step try to read a variable that nothing upstream ever produced?
5:10Juniper: And they stack the checking in three layers, increasing in subtlety. Layer one is the compiler-level stuff — the structural linting. Is the graph well-formed, are the edges valid, is anything unreachable. Honestly, the authors are candid that this layer is the least exciting, because modern models rarely make dumb structural mistakes anymore. So I want to spend our time on layer two, which is the conceptual heart of the thing.
5:38Eric: So what's layer two actually checking that a compiler wouldn't?
5:43Juniper: Meaning. Or at least, a checkable slice of meaning. This is where they borrow an old idea from program verification called Hoare logic — contracts. You annotate every step with a precondition and a postcondition. The precondition is "here's what must be true before I run." The postcondition is "here's what I promise will be true after I run." Think of a relay race. Every runner has to receive the baton to start, and every runner has to hand the baton off at the end. The verifier walks the whole chain and checks every single handoff: nobody starts running empty-handed, nobody finishes without passing forward what the next runner needs. If every promise upstream covers every requirement downstream, the workflow is what they call semantically self-contained.
6:35Eric: And the power there is composition. You don't have to reason about the whole twenty-step plan globally and hold it all in your head. You just check each link locally. If every handoff lines up, the guarantee chains all the way through, no matter how long the workflow is.
6:53Juniper: Let me give you two of the bugs this actually caught, because they're the kind of thing that looks completely fine to the human eye. The first is a parallel-scope bug. A workflow fires off four search branches in parallel, each one writes its own result. Looks great. But there's no gather step — nothing that brings those four results back out to the outer scope. So when the next step tries to read results one through four, they're gone. They were dropped the moment the parallel branches closed. The relay handed off four batons into a room with no door.
7:31Eric: And a human reviewer skims that and sees four searches and a step that uses the searches and nods along.
7:38Juniper: Right. Lean's verdict was just: not all reads are resolvable, proven mechanically. The second one is a schema mismatch. One step produces a chunk of structured data with a field called "snippets." The downstream step expects fields called "passages" and "summary" — which nothing ever produced. So the contract the second step is demanding on the way in just doesn't match what was promised on the way out. Classic. And again, invisible to a quick read.
8:09Eric: Now here's the finding from this section that actually moved me, because it's the cleanest evidence that this isn't just fancy linting. They ran an ablation. Out of forty workflows, twenty-one failed the full layer-two check. Then they stripped out what they call the graph-level predicates — the properties no single step can see, like "does information actually flow continuously across the whole graph," or "are these answer choices being evaluated independently of each other." With those removed, only eight workflows still failed.
8:45Juniper: So thirteen of the twenty-one failures —
8:48Eric: Thirteen of twenty-one were caught only by the cross-step, whole-graph checks. Properties that no local, look-at-one-step inspection could ever see. That's the value. The bugs that matter aren't "this one step is malformed." They're "the way these steps relate to each other is incoherent," and that's exactly what an eyeball, or another model glancing at it, tends to miss.
9:14Juniper: Which is the right moment to deal with the assumption holding the whole thing up — because a careful listener should already be uneasy. We just said the verifier checks that each step's promises cover the next step's needs. But the step is a language model. How can you possibly verify what a stochastic black box promises to produce?
9:36Eric: Yeah, this is the part I circled in the margin. You can't open up the model's head and prove it'll do its job. So what are they actually assuming?
9:46Juniper: They assume it. Explicitly. It's an axiom they call LLMExec, and the honest way to state it is: if a step runs with its preconditions satisfied — if it gets what it needs — then it can produce a state satisfying its postconditions. It does its local job. Picture inspecting a factory assembly line. You can't verify what's in each worker's head. So you make a working assumption — given the right parts, each station produces a correct sub-assembly. And under that assumption, the only thing left to inspect is the plumbing: is station four actually being fed what station three makes? The whole static check becomes "verify the wiring, trust the boxes."
10:31Eric: And I want to push on this, because it's doing enormous work and it sounds dangerously close to assuming away the entire problem. The justification is empirical — language models are genuinely reliable on short, local tasks. Fine. But the tasks this paper targets are hard, long-horizon software bugs that take human engineers over an hour. And local correctness is precisely the thing that degrades on hard problems. The assembly-line worker who's flawless on the easy sub-assembly is exactly the one who starts making mistakes on the gnarly one.
11:08Juniper: And the authors don't paper over that. That tension is the entire reason layer three exists. Layers one and two are static — they reason about the plan assuming the boxes work. Layer three drops the assumption and checks reality. You take an actual execution trajectory, an actual run, and you ask: did the postconditions truly hold this time?
11:32Eric: And how do they check that, given some of these properties are fuzzy natural-language things?
11:39Juniper: A mix, and it's pragmatic. For the cleanly decidable predicates — does this match the schema, is this a non-empty string — Lean checks it directly. For things outside Lean's core, like whether a URL is actually reachable, they call out to an external validator. And for the genuinely fuzzy natural-language properties, they fall back to an LLM judge, but crucially one armed with environment feedback — like the actual failing test messages. And the payoff of layer three isn't a yes-or-no verdict on the whole run. It's localization. It points at the exact step that broke its contract.
12:21Eric: Which is the thing you never get from a transcript. Instead of "the run failed somewhere, good luck," you get "step six promised X and step six did not deliver X."
12:32Juniper: And once you have that, you can do something principled about it. That's the last piece — they call it LeanEvolve. You take the localized failure, hand the diagnosis to a model, and have it rewrite that specific step's instruction. Not randomly mutate the whole plan — fix the step the verifier fingered.
12:54Eric: Let me walk the single best end-to-end case, because it ties the whole pipeline together on one real bug. It's a Django issue. The model runs the workflow, declares the fix verified — and two tests still fail. Sound familiar? That's our cold open. Layer three catches that the verify-fix step's postcondition — basically "I have evidence the fix is verified" — was falsified. Lean said the predicate held syntactically, sure, but the test tool came back with a failure and the judge confirmed the target tests still failed.
13:32Juniper: So the contract was technically satisfied on paper and violated in reality.
13:37Eric: Exactly. And then LeanEvolve rewrites that one step's instruction — now it demands tracing the full error chain and validating against the actual failing tests, not just declaring victory. Next run, it passes. That's the loop closing on a real GitHub issue.
13:56Juniper: So let's talk about whether any of this pays off at scale, because the case studies are vivid but you want numbers.
14:04Eric: The numbers are solid, with caveats I'll get to. Two benchmarks. One is a deliberately hard fifty-problem slice of SWE-Bench-Verified — real GitHub issues, the over-an-hour kind. The other is a hundred expert-level questions about AI papers, multiple choice. And they picked those two on purpose because they stress different things — the software one is loose and exploration-heavy, the paper one depends on tightly structured reading. The headline: workflows that pass verification beat the ones that fail by about fifteen percent on the software tasks and about nine percent on the paper questions. Averages out to almost twelve percent. And then LeanEvolve — the repair loop — adds another seven and a half percent on the software side, pushing that hard subset from roughly fifty-seven up to sixty-four percent.
15:02Juniper: And just passing a static check, before you've spent a dollar running anything, buys you twelve percent. That's the part I keep coming back to.
15:12Eric: But here's the result that genuinely surprised me, and I think it's the most important one in the paper. They ran this across five different models, from frontier ones down to smaller, cheaper ones. And the gains were biggest for the weak models. One of the smaller models jumped over twenty-seven percent between a passing and a failing workflow, versus around thirteen for the big ones.
15:40Juniper: Wait — why would a worse model benefit more from a better plan?
15:45Eric: Because a strong model can partly rescue a bad plan. It improvises, it notices the gap, it works around the dropped result. A weak model is at the mercy of the workflow — give it a broken plan and it has nothing to fall back on. Which flips the deployment story completely. The models you'd actually want to run at scale, the cheap ones, are exactly the ones where workflow verification matters most. It's a lever for making affordable agents reliable.
16:17Juniper: That reframes the whole pitch. It's not a luxury for frontier systems — it's load-bearing precisely where you're cost-constrained.
16:26Eric: And now the contrast that, for me, is the real "this is why it matters" beat. The standing alternative to all this formal machinery is just — ask another capable model whether the workflow looks good. LLM-as-judge. It's cheap, flexible, no Lean required. So they tested it head to head. On the software task, a state-of-the-art model acting as judge gave a failing workflow a score of eight out of ten — and gave a passing workflow a zero.
16:56Juniper: Backwards. Completely backwards.
16:59Eric: Completely backwards. It's like asking a confident, articulate friend to glance at your house and tell you it's structurally sound. They might love the one with the cracked foundation and condemn the solid one. The formal verifier is the licensed inspector checking against an actual code — does the load path carry through, does information actually flow across the steps. And the authors' own diagnosis is precise: the judge captures explicit, surface workflow behavior but misses the implicit information flows and the graph-level constraints. The exact cross-step stuff we said was where the value lives.
17:40Juniper: Which is the cleanest argument in the paper for why you'd reach for formal methods at all. The eyeball approach — even a very smart eyeball — systematically misses the relational defects.
17:53Eric: So let me be the building inspector on this paper now, because there's a real critique and the authors are honest about most of it. The sharpest worry is circularity. Trace the pipeline: a language model reads the natural-language instructions and writes the precondition and postcondition annotations. Then Lean rigorously checks those annotations against each other. And the LLMExec axiom assumes the model is locally correct in the first place.
18:22Juniper: So if the model mis-annotates a step —
18:24Eric: Then the formal check is flawless and it's checking the wrong wall. Rigorous inspection, aimed at the wrong place. The building inspector works from a complete, agreed-upon code — but here the code itself was written by the same kind of system you're trying to keep honest. The deeper version of the worry is this: maybe "passes verification" correlates with success not because the verifier catches real defects, but just because better-structured workflows tend to both pass cleanly and perform well. The verifier might be a proxy for "this plan was written carefully" rather than a true causal filter.
19:01Juniper: And what's their best rebuttal to that?
19:04Eric: The repair experiment, honestly. There's a software case where they fix one localized defect the verifier identified — the context-isolation bug, where every step was built without conversation history, so steps kept referring to information that simply wasn't in scope. They swapped in history-carrying steps, and one model's accuracy went from fifty-two to sixty-two percent. Ten points from one structural change the verifier pointed at. That's the strongest evidence it's causal and not just correlation — you intervened on the exact thing it flagged and the outcome moved.
19:40Juniper: Though it's one case.
19:41Eric: It's one case, and that's the honest qualifier. Which connects to the second critique — the samples are small. The headline comparison rests on three passing versus three failing workflows per condition, on fifty software problems and a hundred questions. They report confidence intervals and most exclude zero, but one model-task pairing has an interval that includes zero, and they say so plainly. With six workflows driving the main claim, it's suggestive, not airtight.
20:12Juniper: And there's a third one they concede that I actually find clarifying rather than damning. They admit that modern models seldom make structural mistakes — which makes layer one, the compiler-style linter, less useful than you'd hope, and makes some of the framework hard to evaluate beyond targeted case studies. So part of the apparatus is solving a problem strong models increasingly don't have.
20:38Eric: Which is why the vivid evidence keeps being anecdotes — the Django case, the parallel-scope bug. They're great stories. They're not a systematic measurement. And I'd add one more honest limitation: "verified" here is doing softer work than it sounds. Because the static guarantee rests on LLMExec, "verified" means "the plan is internally coherent," not "the agent will succeed." A casual listener hears "formally verified agent" and imagines a correctness proof. It isn't that.
21:10Juniper: And to their credit, the authors are explicit on exactly that point. There's a line worth keeping: a verification failure, they write, doesn't mean the workflow can never produce a correct answer — it means the specification isn't self-contained. That's a careful, precise claim, and it's the right altitude for what they've actually shown.
21:33Eric: It is. And it's why I'd frame this as a "this opens a direction" paper rather than a "this solves the problem" one. The authors say as much — they're trying to start a field, not close it. The code wasn't released with the submission, some of the models are future-dated, so independent replication is still ahead of us.
21:53Juniper: But the bet underneath it is genuinely interesting, and I think it's the thing to walk away with. There's a recurring pattern in computing: a domain outgrows eyeball-based quality control and reaches for formal methods to restore trust. Mathematics did it with proof assistants. Safety-critical hardware did it with verification tools. The wager here is that multi-step AI agents are at that same inflection point — that the failures have gotten too long, too tangled, and too mysterious to debug by reading transcripts, and that the most expressive machine-checkable specifications we have, the dependent-type kind that conquered proof-checking, are the right substrate for it.
22:35Eric: And the part that makes me think the bet has legs isn't the twelve percent. It's the reframing. Before this, debugging an agent failure was guesswork — model or plan, no way to tell. The contribution is drawing a clean line: a large share of agent failure is specification failure, and that share is checkable before you run anything. Even if every number in this paper gets revised, that distinction survives.
23:01Juniper: That's the takeaway I'd leave people with. The next time an agent confidently tells you it fixed the bug and the tests are still red — the right first question isn't "was the model dumb." It's "was the plan even coherent." And for the first time, that's a question you can hand to a machine.
23:19Eric: And ask it before you spend the four thousand dollars in API calls.
23:23Juniper: Which, for the record, is roughly what these experiments cost. The paper we've been working through is Lean4Agent. The show notes have a link to it and a few related reads if this caught you — formal methods, agent reliability, the whole thread.
23:40Eric: And if you want the full transcript with every term we threw around defined inline, plus the concept pages linking this to the other agent and verification episodes we've done, that all lives at paperdive dot AI.
23:55Juniper: Thanks for spending it with us. This has been AI Papers: A Deep Dive.