Why AI Coding Agents Keep Trying to Debug Without a Debugger

0:00Jessica: A senior engineer gets handed a bug report. What's the first thing they actually do? They don't sit down and read the codebase end to end. They run the failing test, attach a debugger, set a breakpoint, and they watch what happens. They watch which functions get called, what values flow through them, where the program quietly diverges from what they expected. That's debugging. Now look at how the current generation of AI coding agents goes about the same job. They read code. They grep, they retrieve files, they reason about static text. They might run a test and see it pass or fail — but the actual execution, the call stack, the variable mutations, what the program is really doing — that's all invisible to them.

0:45Finn: They are, essentially, trying to debug without a debugger. And the paper we're digging into today is called "Dynamic analysis enhances issue resolution." Quick housekeeping before we dive in: the paper was posted to arXiv in early April, recorded a few weeks later, and what you're hearing is an AI-generated deep dive. The script is from Anthropic's Claude Opus 4.7. Jessica and Finn are both AI voices from Eleven Labs, and the producer is not affiliated with either company. Now — the system this paper introduces is called DAIRA, and the reason that gap between agent and human engineer matters is that DAIRA closes it in a very specific way, and the result is a category of bug that previously needed a human now resolves automatically.

1:31Jessica: The headline number is just over 79 percent on a benchmark called SWE-bench Verified — which, Finn, we should ground for anyone who hasn't bumped into it before. SWE-bench Verified is five hundred real GitHub issues from twelve real Python projects. Each one comes with a failing test that captures the bug. "Resolving" an issue means the agent's patch makes that test pass without breaking the other ones. So we're not talking about toy puzzles. We're talking about actual reported bugs from projects like Matplotlib and SymPy and Django, fixed end to end.

2:07Finn: And state-of-the-art on that benchmark, among open-source agents. But honestly the headline percentage is the least interesting thing in this paper. What's interesting is the framing. The authors basically say: the field has been investing for the last two years in better reasoning machinery for agents. Tree search, multi-agent debate, self-reflection, learned planners. All of which assumes the bottleneck is how the agent thinks. Their argument is that the bottleneck isn't reasoning. It's perception. The agent is being asked to deduce runtime behavior from static text, which is a thing humans don't even attempt. We just run the code and look.

2:48Jessica: And there's actually a piece of evidence behind that framing. They cite work showing that LLMs lose around half their accuracy on input-output prediction once problems get genuinely hard. So when you ask a model to mentally simulate what a tangled piece of code would do, past a certain complexity, it can't. The natural response — and this is the move DAIRA makes — is: stop asking it to. Give it what human engineers have always had. A window into what the program is actually doing.

3:21Finn: Right. And the cleanest way to feel why this matters is just to walk through one of the bugs they use to motivate the whole thing. There are two case studies in the paper that do all the heavy lifting, and the first one is small enough to hold in your head. It's Matplotlib issue twenty-two seven nineteen. The user passes an empty list to a plotting axis, and Matplotlib spits out a weird, misleading deprecation warning about numeric handling — even though the user never asked for anything numeric.

3:51Jessica: And what's the actual bug?

3:53Finn: The actual bug is in a helper function whose name is, roughly, "is num like" — it's supposed to check whether something looks like a number. And that function, given an empty list, returns true. Which is wrong — an empty list is not a number. But because it returns true, the empty list gets routed down a numeric-conversion code path, which then triggers this deprecation warning that's about something completely unrelated to what the user did. So the symptom — a warning about deprecated numeric handling — is nowhere near the cause, which is that one classification function. And in the call graph, those two things are far apart.

4:31Jessica: That's the part that breaks a static-reading agent.

4:34Finn: Completely. The baseline agent in the paper looks at the warning, suspects the upstream caller, goes and reads that file, doesn't find it, suspects a convert function, goes and reads that file, doesn't find it, then redundantly retrieves another module thinking it might be in there. It's a kind of speculative search through the codebase, eating context window the entire time, and at no point does it actually see what happened. With the trace report in hand, the DAIRA agent immediately sees that "is num like" function returning true on an empty list, and it walks straight to the fix. No flailing.

5:10Jessica: And this is the part where the recipe-versus-cook analogy really earns its keep. A static agent is the food critic reading the recipe trying to figure out why the dish came out wrong, arguing with themselves about which step must have gone sideways. A dynamic-aware agent walks into the kitchen and watches the cook. The asymmetry of effort is enormous. Speculation is exhausting. Observation is cheap.

5:36Finn: And honestly the Matplotlib case is the easy one. The second case study is where it gets impressive. Jessica, you want to take this one?

5:45Jessica: Yeah, this one's worth lingering on. It's a SymPy bug — issue seventeen six thirty. The setup is: you're doing block-matrix multiplication, and one of the blocks is a zero matrix. Mathematically you'd expect everything to stay as matrices. But somewhere in the middle of the computation, an intermediate result comes back as a plain Python integer zero — not a zero-matrix object, just the scalar zero. And then downstream code that expected matrix dimensions tries to use that scalar, and crashes. So the question is: where in this computation did the type degrade from matrix to scalar?

6:24Finn: And this is a nightmare to read statically.

6:27Jessica: It's almost hopeless. The chain of calls goes through polymorphic dispatch — multiplication operator, into a matrix-multiplication evaluator, into the doit method on MatAdd, into the doit method on MatMul, into a postprocessor, and finally into a simplification rule called, roughly, "remove identity." And that's the call chain in the abstract. Reading source code, you have to manually walk through each layer figuring out which override fires for which type. The baseline agent gets lost in this, mis-locates the fault — it thinks the bug is in the block-matrix module — and so what does it do? It writes a defensive fix. It says, okay, after we get this scalar zero back, let's wrap it in a zero-matrix object so the downstream code doesn't crash.

7:16Finn: Which is the sink-overflowing-bucket fix.

7:19Jessica: Exactly. There's a valve upstairs that's stuck open, and the baseline agent decides to put a bucket under the sink. The bug technically goes away — the test passes — but the underlying type-degradation rule is still wrong, and it's going to cause problems somewhere else. DAIRA, with the trace in hand, walks the actual execution tree and watches the moment where the type degrades. And it turns out it's in that "remove identity" rule — a simplification that, when it sees zero-matrix plus zero-matrix, reduces it to the scalar zero. That's the valve. So DAIRA fixes it there, in the addition module, instead of band-aiding the block-matrix module.

7:59Finn: And this is the moment that, if you only take one thing away from this paper, this is it. There are two distinct wins from the dynamic analysis here. One is: the agent finds the right location to fix. But the second is more subtle and arguably more important — the fix is systemic instead of defensive. Because the agent saw the actual flow, it could fix the cause. The baseline agent, working from static text, could only patch the symptom.

8:26Jessica: And those are exactly the bugs that pile up in mature codebases over the years. Defensive patches stacked on defensive patches because nobody could see clearly enough to fix the real thing.

8:38Finn: Okay, so let's talk about how DAIRA actually works, because I think people are going to be surprised by how minimal it is. The system has three pieces. There's a tracing tool. There's a thing that reformats the trace into something readable. And there's a workflow that ties them together into a debugging discipline. That's it.

8:58Jessica: And the design choice that I think is smart on the tracing tool is what they call "trigger-and-collect." The agent doesn't have to learn debugger commands. It doesn't have to write breakpoint syntax. It just writes a normal Python script that reproduces the bug — which is a thing the agent already knows how to do — and DAIRA wraps the execution with standard Python instrumentation hooks. The agent says "run this script," and gets back a trace. Cognitive load on the agent stays low.

9:29Finn: Right. And then the second piece — the reformatter — is where the real cleverness lives. Because here's the thing. If you take a raw program trace, even from a small Python test, you're looking at thousands of events. Every function entered, every variable read, every return value, every exception, in order. It's a firehose. It's the kind of log file anyone who's stared at a verbose debugger output knows the feeling of — your eyes glaze over.

9:56Jessica: And the obvious thing to do, the thing you'd guess works, is just dump that trace into the LLM's prompt. The model has a giant context window, right? Surely more information is better.

10:08Finn: And that's where the killer ablation comes in. Because the authors actually tried this. They ran DAIRA with raw traces — no reformatting, just feed the firehose to the model. And the result is the gem of the entire paper. The baseline agent, with no traces at all, scores about sixty-five percent. DAIRA with the full pipeline scores about seventy-four percent on the same model. DAIRA with raw traces only? Sixty-five-point-eight percent. Essentially identical to baseline. The traces, raw, contributed nothing.

10:41Jessica: Which is an extraordinary result if you sit with it. Because the raw traces contain strictly more information than the structured version. The structured version is derived from the raw version. If the model could parse the raw, it would have everything it needs. But it can't. The structured re-rendering is doing essentially all of the work.

11:03Finn: And the analogy that lands this for me is the lawyer one. Imagine you're a lawyer and someone hands you ten thousand pages of email discovery, versus a five-page memo organizing the relevant exchanges into a timeline with annotations. The lawyer with the memo wins, even though the raw emails contain more information. It's not about completeness, it's about the reader's attention. And what DAIRA does — the structured trace, the nested tree of which function called which with what data — it's the memo version of the firehose.

11:35Jessica: And, Finn, the format itself is interesting. They use indented ASCII trees. Which sounds almost embarrassingly simple. But the reason it works is that LLMs are pre-trained on enormous amounts of indented code and indented documentation. Indentation is a hierarchy signal the models already know how to parse fluently. So instead of dumping JSON, which is verbose and token-heavy, or some structured format the model has to learn, they use the format the model already reads natively.

12:05Finn: It's the org chart of the program's execution. You can see at a glance: this function called this one, which called these three, this is where the exception fired. You don't have to reconstruct the hierarchy in your head — the indentation is the hierarchy.

12:21Jessica: And then the third piece, the workflow — this is really just an orchestration layer. It runs the agent through three phases. First, generate a script that reproduces the bug. Second, trace it and diagnose what went wrong. Third, write the patch and re-trace to verify. The agent can re-invoke the tracer at any point. And the ablation suggests this orchestration is the smallest of the three contributions — it adds maybe four points and mostly improves cost efficiency. The tool and the reformatter are doing the heavy lifting.

12:53Finn: Which I think is honest of the authors to surface. They could have sold the workflow as the third pillar of the system. The data says it's more of a polish layer.

13:03Jessica: There's one more empirical detail I want to land before we get into the critique, which is this idea I've started thinking of as the token paradox. You'd assume that adding trace reports to the prompt — extra context — would increase how much the model is reading. The reverse happens. DAIRA cuts input token consumption by about twenty-five percent compared to the baseline.

13:26Finn: Because the baseline is fishing.

13:28Jessica: Exactly. The baseline agent, without runtime visibility, spends enormous amounts of context reading file after file, trying to deduce what's happening. Once you give it a precise pointer to where the bug actually is, all that exploratory reading evaporates. So the agent reads less code overall *because* you gave it the right place to look. More targeted information equals less total context burned. It's a quietly important result, because for anyone paying for these agents at scale, it means dynamic analysis isn't just better — it's also cheaper.

14:01Finn: And there's a wrinkle in that cost story that I think is genuinely fascinating. The behavioral shift looks different depending on which model is driving DAIRA. They tested it across three model backbones. And the same tool produced three different profiles. With one of the lighter models — Qwen-3-Coder Flash — input tokens drop by something like thirty to forty percent, but the model's actual call count and output don't change much. The interpretation is: the lightweight model just stops doing blind, context-heavy trial-and-error. It was thrashing, and now it's not. With Gemini, the call count actually goes up by fifteen to twenty-four percent, but each call is leaner. So Gemini is replacing big file dumps with many targeted queries. And with DeepSeek, input tokens drop, but output tokens go up by twenty-some percent. DeepSeek takes the saved input budget and spends it on deeper generative reasoning.

15:00Jessica: Same tool, three models, three personalities.

15:03Finn: Three personalities. Which is a kind of charming reminder that these systems are not monolithic. The same instrument plays differently in different hands.

15:13Jessica: Okay. Let's push on the work now, because there's a real critique to make. Finn, where do you want to start?

15:20Finn: The most important caveat is that the headline comparison isn't quite apples-to-apples. DAIRA's seventy-nine-point-four percent number uses Gemini 3 Flash Preview as the underlying model. The strongest baseline they report — an agent called Live-SWE-agent at seventy-nine-point-two percent — uses Claude 4.5 Opus. The two leading numbers are two-tenths of a point apart on different model backbones. And the authors are honest about this; they say they couldn't run baselines on the same model because of API access constraints. But it does mean the "state-of-the-art" claim partially rests on backbone choice.

16:00Jessica: The cleaner comparison is the controlled head-to-head.

16:03Finn: Right. They run their own controlled experiment where they take vanilla SWE-agent — the foundational system everyone in this space builds on — and compare it to DAIRA on the same models. There the gain is real but more modest. About five or six points on Gemini, similar on the others. So the contribution is genuine, but it's not a forty-point leap. It's a substantial, replicable improvement on top of an existing strong baseline.

16:29Jessica: There's a related point I'd push on, which is that the benchmark itself is a generous setting for this kind of method. SWE-bench Verified is constructed entirely from issues that have reproducible failing tests. Which is exactly the situation where dynamic analysis is maximally useful — you always have something concrete to trace. In the wild, plenty of bugs don't have a clean repro. Intermittent bugs, race conditions, bugs that depend on production data the agent doesn't have. The paper doesn't probe that regime, and it's the regime where the dynamic analysis story gets harder.

17:05Finn: That's a real limitation. Although I'd say in defense of the authors, they're not claiming to have solved every kind of bug. They're claiming that for the substantial class of bugs that do have repros — which is a lot of real-world bugs, especially in mature open-source projects — observability beats more reasoning. That's a defensible scope.

17:26Jessica: The other thing I want to flag — and this is more of a "huh, interesting" than a critique — is a circularity question that the architecture raises. The reformatter, the thing that turns raw traces into the structured execution tree? That's itself an LLM call. So when you ask "how much of DAIRA's gain comes from dynamic analysis per se, versus from an extra LLM pre-processing step that organizes information," it's not totally clean to disentangle. The ablation does separate raw-trace from structured-trace, which is good. But it doesn't ask a related question: what if you applied the same kind of LLM-driven semantic reorganization to a *static* call graph? Would that capture some of the benefit?

18:10Finn: That's a good open question. I don't think it would capture all of it, because a static call graph can't tell you what values actually flowed through. But it might capture a chunk.

18:21Jessica: And one last thing on the experimental side — the hardest tier of bugs, the ones estimated to take human engineers more than an hour, that bucket only has forty-five instances. So when DAIRA hits forty-four percent there, that's twenty bugs solved. And some of the more dramatic relative-improvement claims — there's one figure of more than doubling for one of the model backbones on hard tasks — those are off small denominators. They could swing meaningfully with a few coin flips. Worth being a little cautious about the specific percentages even as the directional finding holds.

18:58Finn: All fair. And the authors flag most of these themselves, which I appreciate.

19:03Jessica: So where does this leave us. What's the takeaway that survives the critique?

19:08Finn: I think the takeaway that survives is the framing, more than any specific number. The argument that LLM coding agents have been replicating only half of what human engineers do — they've been doing the reading-the-code half, and ignoring the watching-it-run half. Add observability, and a lot of the elaborate reasoning scaffolding people have been building starts to look like it was compensating for missing inputs. The twenty-five percent token reduction is the quiet corroboration. Less searching, less guessing, less padding. Because the agent isn't trying to reason its way out of needing a debugger anymore.

19:44Jessica: And there's a more general lesson sitting underneath that, which I think is the durable one. When you see a system over-reasoning, over-searching, over-elaborating — the temptation is to add more sophisticated reasoning machinery. But sometimes the right move is to ask what the system can't *see*. Because reasoning under bad inputs and reasoning under good inputs are different problems. And a lot of progress in this field might come from instrumentation — giving models better windows into the systems they're operating on — rather than from cleverer prompting or more elaborate planning.

20:19Finn: Which in some ways is the oldest lesson in engineering. Before you optimize, measure. Before you guess, look. The new generation of AI coding agents is, in this small but real way, finally being taught to look.

20:32Jessica: This was the paper "Dynamic analysis enhances issue resolution," published in early April of this year. The link is in the show notes along with related materials, if you want to dig deeper. Thanks for listening to AI Papers: A Deep Dive.

20:46Finn: See you next time.