Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment

0:00Cassidy: Picture this. You give an AI agent a task — "generate the quarterly report." It dives in. Pulls tables. Joins data. Formats charts. Thirty actions deep, with the report nearly assembled, it finally surfaces enough context to realize you meant *fiscal* quarters, not *calendar* quarters. Everything it built is wrong. And the punchline of the paper we're talking about today is that a single clarifying question at action two would have prevented all of it.

0:31Eric: It's a great opening shot because it points at a question nobody had actually measured. The literature treats clarification as a yes-or-no capability — the agent either asks for help, or it doesn't. But anyone who's worked in decision theory or human-computer interaction for the last fifty years would tell you: the *timing* of when you ask has to matter at least as much as whether you ask.

0:58Cassidy: Right. And the paper is called "Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?" — it went up on arXiv earlier this month, and we're recording a few days after that. Before we get into it: this is an AI-generated deep dive. I'm Cassidy, that's Eric, and we're both AI voices from Eleven Labs. The script is from Anthropic's Claude Opus 4.7, and the show isn't affiliated with either company. With that out of the way — the reason the fiscal-versus-calendar quarter example matters is that it's not just a cautionary tale. It's the seed of an actual experimental design.

1:41Eric: And the design is the interesting move here. Because if you wanted to study clarification timing the obvious way, you'd give an agent an ask-the-user tool, drop it on some ambiguous tasks, and measure what happens. But that tangles two questions together. One: does the agent *notice* something is ambiguous? Two: when, in the course of execution, is asking actually worth something?

2:07Cassidy: And those are different abilities. The authors — a team out of PricewaterhouseCoopers in the U.S. — wanted to isolate the second one. They wanted the pure timing effect. So they did something kind of clever: they decided not to let the agent ask at all. Instead, they pretended the agent had already noticed the ambiguity, and they tested what happens if a clarification just *arrives* at a specific moment.

2:36Eric: The analogy I keep reaching for is the difference between testing a smoke detector and testing a fire department. If you want to evaluate how well a fire department responds to alarms, you don't sit at the station measuring whether they spot the smoke themselves. You trigger the alarm yourself, and you time the response. The authors trigger the alarm.

3:00Cassidy: Exactly. They call it forced injection. The agent gets an underspecified task — the prompt is missing something — and somewhere mid-execution, a synthetic user message lands in the conversation. Something like, "By the way, I should have mentioned: the target format is CSV, not JSON." It arrives at the next clean turn boundary, no fanfare. And the agent just absorbs it and keeps going.

3:27Eric: The thing that makes this work as an experiment is the calibration trick. Tasks vary wildly in length — some take three actions, some take over a hundred. So "inject at action twenty" means very different things for a short task versus a long one. To make timings comparable, the authors first run an oracle trial — that's the agent with the full, unambiguous prompt — and they measure how long the task usually takes. That becomes the action budget. Then injection points are percentages of that budget. Ten percent of the way through. Thirty. Fifty. Seventy. Ninety.

4:06Cassidy: So injecting at fifty percent means roughly the same thing for a ten-action task and a fifty-action task, in terms of how committed the agent already is to its current path. That's what lets them draw what they call value-of-information curves — VOI — one per type of missing information.

4:26Eric: And that's the other piece of the design, Cassidy — the fact that they don't just have one curve. They have four. Because they sliced "missing information" into four kinds.

4:37Cassidy: Right. So think about what could be missing from a task description. The *goal* — what are we actually trying to produce? Fiscal or calendar quarters, CSV or JSON, an executive summary or a deep-dive. The *inputs* — where does the data live, which file, which folder. The *constraints* — rules the output has to satisfy, like "all figures must be in thousands." And the *background context* — domain knowledge or environment details the agent would need to make sensible choices.

5:09Eric: And the prediction — before they ran anything — was that these four dimensions should behave differently. A missing goal poisons every downstream decision. A missing input only matters once you reach the data-fetching step. A missing constraint might not bite until you've already violated it. So the decay curves shouldn't be a single shape — they should be four different shapes.

5:35Cassidy: This is where the one piece of math in the paper does real work, and it's actually pretty intuitive. The authors define a notion of "commitment" — at any point in the trajectory, some fraction of the actions you've taken are causally locked in to a particular interpretation of the missing information. Once you're committed, undoing that work costs more than the clarification could possibly save you.

6:02Eric: The analogy that nails this for me is writing an essay in the wrong direction. You're eight paragraphs in, and a friend looks over your shoulder and says, "Wait — the prompt was about X, not Y." The most that correction can save you is whatever you *haven't* written yet. Everything already on the page under the wrong interpretation is, at best, salvageable for parts. The earlier your friend speaks up, the more essay you have left to benefit from the correction.

6:33Cassidy: And that's the whole formal claim, basically. The value of a clarification at any moment is capped by the fraction of your work that isn't yet locked into the wrong path. Different information types commit at different rates — and so different information types should have different decay curves.

6:54Eric: Goal commits everything from the very first action. Input commits only when data-fetching happens. Constraint can commit almost nothing until you actively violate the rule. Context is somewhere between goal and input — early background knowledge tends to cascade. Those are the predictions. Then they ran the experiment.

7:16Cassidy: They ran a lot of experiment. Eighty-four underspecified tasks, three benchmarks, four frontier models — GPT-5.2, Claude Sonnet 4.5, Gemini 3 Flash, and DeepSeek V3.2 — and over six thousand total trials. About forty-two hundred dollars in API costs. And the curves come out roughly as predicted, with one really striking specificity.

7:39Eric: The goal curve is a cliff.

7:41Cassidy: It's a cliff. Here's the cleanest number from the paper. On the benchmark with the strongest signal — MCP-Atlas, which is a tool-use benchmark with relatively short trajectories — the oracle baseline, where the agent has the full prompt, gives about eighty percent success on pass-at-3. The no-clarification baseline — the agent just guesses — gives about forty percent. And if you inject the missing goal information at the ten-percent mark, you get seventy-eight percent. Basically the oracle.

8:15Eric: So a clarification ten percent of the way through is worth almost the full value of having had the goal from the start.

8:23Cassidy: But by seventy percent of the way through, the same clarification is indistinguishable from never asking at all. The window is *that* narrow. There is a tiny early region where asking about the goal is worth a fortune, and then it's worth nothing.

8:39Eric: And the GPS analogy is the one that locks this in for me. If your navigation is set to the wrong city, every turn after that compounds the error. You're not slightly off — you're going somewhere else. Correcting the destination in the first minute costs nothing. Correcting it after two hours of driving means two hours of wasted driving. Goal information conditions every downstream decision the same way.

9:07Cassidy: The input curve looks completely different. At ten percent, asking about a missing input gives you about forty-six percent. At thirty to fifty percent, it's still around thirty-six. It's a gentle slope, not a cliff. You have until roughly the halfway mark before input clarifications stop being clearly worthwhile.

9:28Eric: And then there's a detail I want to flag, Cassidy, because it's almost counterintuitive. By the ninety-percent injection point for input information, success actually drops *below* the no-clarification baseline. Twenty-five percent versus thirty-three for never asking at all. Late input clarification is worse than no clarification.

9:50Cassidy: Because by that point the agent has either inferred something, made do, or built around the gap — and a late-arriving correction forces it to discard work it had already integrated. The clarification stops being help and starts being interference.

10:06Eric: Constraints behave the strangest, as predicted. The paper gives a budget-report example: the task omits "all figures must be in thousands." When that constraint arrives at the fifty-percent mark, the agents try to retroactively rescale their output — and sometimes they introduce rounding errors that drop them below where they would have been with no constraint at all. Late constraint clarification can be actively destructive.

10:35Cassidy: There's a contractor analogy that handles all of this nicely. If your kitchen renovation is three weeks in and you say "wait, I wanted the sink on the *other* wall" — that's catastrophic. The whole layout was conditioned on the wrong assumption. But if you say "I wanted brushed nickel handles, not chrome" — those go on at the end anyway. No problem. Sink location is a goal-level question; handle finish is an input-level question. Same renovation, completely different decay profiles.

11:07Eric: And the cleanest concrete example in the whole paper is in the appendix — the CSV-versus-JSON case study. The prompt says: "Export the data in a format suitable for spreadsheet analysis." All three models default to JSON without clarification.

11:24Cassidy: Which, fair — JSON is suitable for many things. Just not spreadsheets.

11:28Eric: Right. Inject "CSV" at ten percent, and they get it right immediately. Inject at fifty percent, and they have to throw away three to five actions to restart. Inject at ninety percent, and they're already done — wrongly. One missing word — CSV — and the cost of telling them ranges from zero to "rerun the whole thing."

11:50Cassidy: There's also a check the authors run to make sure these curves aren't just artifacts of one model. They use a rank correlation called Kendall tau — and for our purposes the only thing to know is that it's a number between minus one and one, where high positive values mean two rankings line up well. They find values in the zero-seven-eight to zero-eight-seven range across models on the matched task subset.

12:17Eric: Which means the different frontier models broadly agree on which tasks benefit most from early clarification. The timing pattern is a property of the *task*, not a quirk of any particular model. That's a load-bearing finding, because it means the curves describe the structure of the problem, not the personality of one system.

12:40Cassidy: Okay. So that's the demand side — when *would* a clarification help. And the paper could have stopped there and it would already be a useful contribution. But there's a second study, and this is where the episode really turns.

12:55Eric: This is the part that lit me up when I read it. They flip the setup. In the forced-injection runs, the ask-the-user tool was disabled — agents knew they couldn't ask. In the second study, they turn the tool back on, they don't force any injection, and they just watch what the models do when they're on their own with an ambiguous task. Three hundred sessions. One benchmark. A simulated user with access to the ground truth, who responds when asked.

13:25Cassidy: And what you do with the data is overlay the natural ask timings on the empirical VOI curves you already drew. So you can literally see where each model chooses to ask versus where asking would actually have been worth something.

13:41Eric: None of them land in the right window. Not one frontier model.

13:46Cassidy: Let me give you the three numbers, because they're sharp. GPT-5.2 asks in fifty-two percent of sessions — so a majority of the time. Mean first-ask is at forty-three percent through the trajectory. Average of one-point-seven asks per session. Per-session success rate: three percent.

14:05Eric: Claude Sonnet 4.5 asks in twenty-three percent of sessions — about half as often as GPT. Always exactly once. Mean timing fifty percent through. Per-session success: eleven percent.

14:17Cassidy: Gemini 3 Flash asks in zero percent of sessions. Zero out of one hundred. Never asks.

14:23Eric: There are basically three coworker archetypes here. The over-asker who asks late and often — too late for the questions to help with goal-level issues, and too often for the user to want to deal with them. The selective asker who asks rarely but doesn't quite hit the goal window either. And the silent one who just delivers whatever they came up with.

14:48Cassidy: And the detail that's going to bother smart listeners, Eric, is the Claude-versus-GPT comparison. Claude asks less often, asks slightly *later*, and succeeds more than three times as often. Eleven percent versus three.

15:03Eric: Yeah. The authors are careful about this — they flag it explicitly as suggestive rather than proven. The framing they offer is that "question quality may matter more than frequency." But it's a between-model comparison where the asking strategy is confounded with everything else — pretraining, reasoning style, tool use, how they handle the rest of the trajectory. You can't cleanly attribute the success gap to the asking pattern alone.

15:34Cassidy: It is striking, though. It suggests that the field's instinct to push models toward asking *more* might be exactly the wrong instinct. The model that asks the least, and asks slightly worse-timed questions in raw terms, is the one doing the best.

15:51Eric: Which is consistent with the broader story. None of the models are calibrated to the actual shape of the demand curves. GPT-style over-asking pays a cost in user friction and latency. Gemini-style never-asking pays a cost in catastrophic early errors. Claude's selective approach is closer to the right behavior but still misses the narrow goal window.

16:16Cassidy: And this is where the intellectual reframing the paper offers comes through. The industry's current answer to "should the agent ask?" is essentially a single confidence threshold. If the model feels uncertain enough, it asks. Otherwise, it presses on. The paper's argument is that the single-threshold framing has the wrong *shape*.

16:39Eric: The right question isn't "am I uncertain enough to ask." It's "am I uncertain about something that still has positive expected value to clarify *at this point in the trajectory*." Goal ambiguity at action three is gold. Goal ambiguity at action thirty is worthless. Input ambiguity at action three is okay. Input ambiguity at action twenty is still okay. You need a *typed* gate that knows which dimensions are still worth resolving and which have already aged out.

17:10Cassidy: Clarification stops being a capability the model has and becomes a time-sensitive resource the model has to spend carefully. That's a reframing that opens up a bunch of adjacent problems — human-in-the-loop design, agent self-monitoring, interruption policies — in a way that the previous binary framing couldn't.

17:31Eric: Okay. I want to push on the limitations, because the paper is genuinely strong but it has cracks, and the authors are pretty honest about most of them.

17:41Cassidy: Please.

17:41Eric: First — the cleanest signal in the paper comes from one benchmark. MCP-Atlas. That's where you get the textbook cliff for goal information and the gentle slope for input. The other two benchmarks — TheAgentCompany, which is the enterprise workflow one, and swee-Bench Pro, the code repair one — show the same patterns but messier. TheAgentCompany has floor effects: the oracle pass rates are only twenty to twenty-nine percent, so there's not a lot of dynamic range to detect a timing effect in.

18:15Cassidy: And swee-Bench Pro has the strange property that some injection conditions actually exceed the oracle baseline. Which shouldn't happen if the framework is clean.

18:26Eric: Right. The authors attribute that to two things. One is sampling variability — the per-cell sample sizes aren't huge. The other is more interesting and more uncomfortable: a synthetic injection message — "by the way, the target format is CSV" — may be more *salient* to the model than the same information buried inside a longer original prompt. The forced-injection protocol might systematically over-state the value of clarification because the injected information is more conspicuous than the same information delivered up front.

19:01Cassidy: That's worth dwelling on. Because if it's true, the VOI curves aren't just upper bounds — they're *biased* upper bounds. Real-world clarification might be a bit less valuable than the experimental curves suggest, because real-world clarification doesn't get the salience boost.

19:20Eric: The second confound is structural. In the forced-injection runs, the ask-the-user tool is disabled. Agents *know* they can't ask. They might plan differently when they know they're on their own — exploring more aggressively, hedging more conservatively. So the "optimal window" you derive from forced injection might not be the optimal window for an agent that has access to its own ask tool, because the agent in the natural-ask condition is operating in a different strategic environment.

19:53Cassidy: The authors flag this. They explicitly call the VOI curves upper bounds, not point estimates of value.

20:00Eric: Third — and this one's a real one — the statistical machinery behind the headline "deferring past mid-trajectory drops you below never asking at all" is thinner than the prose suggests. They find a significant "point of no return" only for goal and constraint dimensions on the outcome-critical task subset, at the thirty-percent mark. For input and context, they don't find a statistically significant threshold. The directional finding is supported by point estimates, but you can't always distinguish it from sampling noise — especially for input, where the ninety-percent injection sits at twenty-five percent and the no-clarification baseline is thirty-three.

20:45Cassidy: And sample sizes for context are genuinely tiny. Five tasks on MCP-Atlas. Twelve on swee-Bench Pro. The context-dimension findings are essentially anecdotal, and the paper does say so — but in the abstract, context gets grouped with goal under the "front-loaded" claim without that caveat.

21:05Eric: To the authors' credit, they're forthright about all of this. The acknowledged limitations section reads honestly rather than defensively. And the broader contribution survives the critiques — the *direction* of the findings is robust even where the magnitudes aren't.

21:23Cassidy: There's one more limitation that's almost philosophical rather than empirical, and I think it's the important one to name. This paper measures the *demand* side of clarification. How much would a well-timed question help? When does its value peak and decay? It does not tell you how to *build* an agent that actually asks at the right moment.

21:47Eric: Right. There's no policy here. No training procedure. No new architecture. The paper draws the demand curves and then shows that no existing model matches them. Closing the gap is left as future work.

22:00Cassidy: Which I think is actually the right scope for this paper. Generations of agent research have been gesturing vaguely at "agents should ask when they're uncertain" without empirical grounding for *when* asking would actually pay off. This paper gives the field a concrete target to optimize against.

22:20Eric: A concrete, *typed* target. Not "ask when you're uncertain," but "ask about the goal within the first ten percent of the trajectory or don't bother, ask about inputs by the halfway mark, treat constraints as a separate decision because late constraint information can actively hurt." That's the kind of operationally specific guidance that earlier theoretical frameworks couldn't provide.

22:45Cassidy: There's a secondary metric in the paper I want to mention briefly because it makes the cost story tangible. They define a "wasted compute" measure — basically, the fraction of pre-injection actions that don't appear in the oracle trace. Actions the agent took that, with full information, it wouldn't have needed to take.

23:05Eric: And on TheAgentCompany, wasted compute rises pretty cleanly from zero percent at the ten-percent injection point to about twenty-two percent at the ninety-percent injection point. So the later you ask, the more of the agent's compute was spent doing the wrong thing. Which corroborates the VOI story from a different angle — it's not just that success rates drop, it's that wasted work compounds linearly.

23:31Cassidy: On MCP-Atlas the wasted compute story is even starker, because those trajectories are so short that even the very first action commits something. Wasted compute starts at thirty-eight percent at the earliest injection and plateaus around fifty-three percent by the ninety-percent point. There's no clean zero-cost regime on short-trajectory tasks — by the time you've taken one action, you've spent something irreversible.

23:59Eric: I want to take one step back and place this in the broader conversation, Cassidy, because I think it matters where this work sits. The field is in the middle of a real transition from language models as one-shot question-answerers to language models as agents that take many actions over long horizons. And the moment you stretch the horizon, a new family of problems shows up that didn't exist in the chatbot era. Cascading errors. Wasted compute. Irreversible commitments. When to involve a human.

24:33Cassidy: And the older literatures had a lot to say about all of this. Decision theorists have had formal tools for value of information since the 1950s and 60s — Howard, Lindley. HCI researchers have known for decades that the timing of interrupting a human at a task matters enormously, and that poorly-timed interruptions have recovery costs much larger than the interruption itself. Adamczyk and Bailey in 2004. Mark and colleagues in 2008.

25:03Eric: This paper is, in a real sense, porting both of those observations into the agent setting and showing they hold empirically. Clarification timing in LLM agents has the same decay structure that information acquisition has in classical decision theory, and the same recovery-cost dynamics that interruption costs have in human task performance. The field is gradually realizing that the problems of long-running autonomous systems aren't all new — some of them are recapitulations of problems that decision theory and HCI worked on for decades, and the old conceptual tools need to be retrofitted rather than reinvented.

25:46Cassidy: And I think that's the genuinely satisfying intellectual move here. It's not "we invented a new framework." It's "the framework already existed; nobody had drawn the empirical curves; here are the curves." Modest, careful, and useful.

26:02Eric: So what does a builder do with this?

26:05Cassidy: A few things. If you're shipping an agent product, the first practical implication is that "should I ask" is the wrong gate. You want a typed gate that knows what kind of ambiguity it's detecting and where in the trajectory the agent currently is. Goal ambiguity gets a hair trigger — ask immediately or commit. Input ambiguity gets a longer leash. Constraint ambiguity needs its own logic because the late-arrival case is actively destructive.

26:34Eric: Second — and this is the uncomfortable one — the calibration data you'd need to build that gate well doesn't really exist yet for most agent domains. The paper provides curves for three benchmarks. If you're operating in legal workflows, or scientific computing, or customer service automation, the *shape* of the decay curves is probably similar, but the *thresholds* — exactly how narrow the goal window is, exactly where the input slope flattens — those are going to be domain-specific. Somebody has to run the equivalent experiments for whatever domain they care about.

27:12Cassidy: Third, the natural-ask gap suggests there's real product value in just getting agents to ask differently. Not necessarily more — Claude asks less and succeeds more. The variable to tune isn't frequency, it's timing and quality.

27:27Eric: And there's a research direction implied: training agents specifically to identify goal-level ambiguities early. Most of the existing alignment work on "asking for clarification" treats it as a uniform behavior. The paper's framework suggests it should be a *structured* behavior — different policies for different dimensions, conditioned on trajectory position.

27:52Cassidy: For listeners who want to go further, the underlying benchmark — LHAW, from Pu and colleagues earlier this year — provides the underspecified task variants this paper builds on. And HIL-Bench from Elfeki and colleagues is the natural sibling, which penalizes both over-asking and missed escalation but doesn't vary timing as an independent variable. Together those three pieces of work are mapping out a real new subfield.

28:20Eric: The thing I'll carry away from this paper is the reframing I mentioned earlier. Clarification as a time-sensitive resource rather than a capability. Once you have that framing, a lot of currently-confusing agent behavior starts making sense. Over-asking late-trajectory looks like a model that has the *capability* to ask but doesn't have a notion of expiration. Never-asking looks like a model whose threshold for noticing ambiguity is set too high. Neither is a problem of asking ability; both are problems of asking *value*.

28:56Cassidy: And the call to action is implicit but clear. We need agents that know when to ask is over. That know that a goal question at action thirty isn't a question anymore — it's an admission of defeat dressed up as one.

29:10Eric: That's a sharp way to land it.

29:12Cassidy: This is "AI Papers: A Deep Dive." Paper's linked in the show notes, along with some related reads if this is your kind of thing. Thanks for listening.