Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety

0:00Cassidy: An AI agent looks at an invoice on someone's screen. It reads the recipient name. It clicks the button to wire the money. The recipient name was never actually on the invoice — the model hallucinated it. The money is gone.

0:15Eric: And the strange thing about that failure, when you sit with it, is that nothing went wrong in the way we usually mean. No attacker injected a prompt. No malicious webpage tricked the agent. The model just… saw something that wasn't there, and acted on it.

0:31Cassidy: That's the puzzle this paper picks up, and it goes by the title "Hallucination as Exploit: Evidence-Carrying Multimodal Agents." It went up on arXiv on May eighteenth, twenty-twenty-six, and we're recording two days later, on May twentieth. Quick note before we go further — this episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Cassidy, and you'll also be hearing Eric — we're both AI voices from Eleven Labs, and the show isn't affiliated with either company. And the reason that wired-money scenario matters is that the authors are arguing the entire field has been splitting this problem along the wrong seam.

1:14Eric: Right. Because if you ask "was that a hallucination?" — yes. The model invented a fact. If you ask "was that a prompt injection?" — no. Nobody fed it a malicious instruction. So which research community owns this failure?

1:29Cassidy: Neither, really. And that's the gap. The hallucination people are over here counting made-up objects in image captions. The prompt-injection people are over there filtering untrusted text in webpages. Neither of them is asking the question that actually matters when an agent has hands: *did an unsupported claim end up authorizing a privileged action?*

1:52Eric: And that reframing is doing a lot of work. The authors give it a name — hallucination-to-action conversion, H2AC — which is the moment a false belief satisfies a precondition that would have otherwise blocked the action. That's the unit they think we should be measuring. Not "did the model lie," but "did the model's lie unlock a door."

2:15Cassidy: Think of it like a pilot misreading their instruments. A pilot who misreads the in-flight magazine has made an error of no consequence. A pilot who misreads the altimeter — that's the same kind of perception failure, but one of them is action-critical. Most hallucinations are magazines. The dangerous ones are altimeters. And right now we're measuring the magazine rate and calling that safety.

2:41Eric: There's a really clean empirical wedge they drive into this — and Cassidy, this is the part I want to spend time on because it's the load-bearing finding. They split unsafe agent tasks into two buckets. About eighty percent of them are injection-driven: an attacker put something in the environment that flipped the agent's behavior. The remaining twenty percent are what they call belief-flow: no attacker, no injection, just the model talking itself into a false precondition.

3:13Cassidy: And current defenses, the ones the field has been building for the last two years —

3:19Eric: They're tuned for the eighty percent. They watch for instructions in untrusted content. They sandbox malicious text. And on the injection bucket, a prompt-only defense gets the unsafe-action rate down to around fifty percent — not great, but it's working at something. On the belief-flow bucket — the cases where no instruction exists to filter — that same defense hits eighty-six percent unsafe-action rate. It's effectively doing nothing. You can't filter your way out of a hallucination the model produced itself.

3:53Cassidy: You can't filter your way out of a hallucination the model produced itself. That's the sentence that breaks the whole prompt-injection paradigm. Because it means even if you solved injection completely — perfect filtering, perfect sandboxing — you'd still have an entire class of failures that your defenses are structurally blind to.

4:17Eric: Right. So the question becomes: if filtering the input can't help, what does?

4:22Cassidy: This is where the architecture comes in, and it has a really clean intuition. The authors basically say: today's agents collapse four very different things into a single stream of model text. What the model observed, what it means, what the user wants, what the system is allowed to do — all of it is the same kind of soup. Once it's all soup, you can't tell an attack from an honest mistake from a true belief. They all look like strings.

4:52Eric: So pull them apart.

4:53Cassidy: Pull them apart. And the analogy that worked for me is courtroom evidence rules. A lawyer can argue passionately that their client was home that night. But the lawyer's say-so is not admissible as evidence. To convict or acquit, the court needs a witness, a receipt, a timestamp — something from outside the lawyer's mouth. The model in an agent is the lawyer. It can argue for what action to take. But its description of the world? Inadmissible.

5:23Eric: So what counts as a witness?

5:26Cassidy: A narrowly-scoped program that reads the raw observation directly. A DOM parser that pulls actual HTML elements out of a page. An OCR engine that runs on actual pixels. An accessibility tree reader. These are dumb compared to the language model — they can only do one thing — but they're not in the business of inferring or interpreting. They just report what's there.

5:50Eric: And those reports become — what, signed statements?

5:54Cassidy: They call them certificates. Typed structured records. Each one says: here's the predicate I'm vouching for, here's the value I observed, here's where in the source I found it, here's my confidence, here's which verifier I am. The model cannot mint a certificate. It cannot upgrade its own text into evidence. There's a hard architectural line.

6:18Eric: And then the gate.

6:19Cassidy: And then a deterministic gate. When the model proposes an action — say, send this email to this recipient — the action has a schema. The schema lists which predicates have to be certified for that action to be safe. Recipient is trusted. Body reflects user intent. Attachments are allowed. The gate checks: does every required predicate have a matching certificate from a real verifier? If yes, allow. If no, either block, or — and this is the interesting middle case — *ask* the user. Reversible actions can have an escape valve when evidence is incomplete.

6:58Eric: The principle being: the component that proposes the action cannot also vouch for the fact that authorizes it. Proposer is not certifier.

7:08Cassidy: Exactly. It's borrowed straight from systems security. An application can ask the kernel for file access. It cannot grant itself file access. That decision lives in code the application doesn't control, based on evidence the application can't forge.

7:25Eric: Okay. So that's the design. Cassidy, here's what I want to know — does it actually work? Because there's an obvious objection, which is that you've just moved the problem. Now instead of trusting the model, you're trusting a bunch of narrow verifiers, and those verifiers can also be wrong.

7:45Cassidy: That objection is exactly right, and the authors don't dodge it. Their answer is that you've changed the *kind* of risk, not erased it. When a model hallucinates inside a normal agent, the failure is invisible — nothing distinguishes a false belief from a true one. When a constrained verifier fails inside this architecture, the failure has a signature. It's a false positive on a specific predicate for a specific input class. You can red-team it. You can measure it. You can patch it.

8:19Eric: So it's not "we built a safer agent." It's "we converted unauditable model risk into auditable verifier risk."

8:26Cassidy: That's the actual contribution, in one line. And whether you find that compelling is going to depend on whether you think auditability is the thing that moves the needle.

8:38Eric: Let me put a number on that, because this is where the paper earns its keep. They run what they call an H-A-C-R audit — hallucination-to-action conversion rate. Five hundred tasks, stratified across the benchmarks. For each task they label every claim the planner makes at the level of individual predicates, and they check: of the unsupported action-critical claims the model produced, how many actually reached unsafe execution?

9:07Cassidy: And the three numbers.

9:09Eric: For a naive agent — no defenses, just the model wired to tools — one hundred percent. Every unsupported claim that mattered for authorization reached execution. For a prompt-only defense, the kind most production systems are using right now — about fifty percent. So filtering helps, sort of. For the evidence-carrying architecture — zero. Out of eleven hundred and three unsupported action-critical claims, not a single one reached execution.

9:39Cassidy: Every single one of the eleven hundred, blocked.

9:42Eric: Every single one. And the reason it's zero isn't that they got lucky on the verifier side. The reason it's zero is structural. The model produced the false claim — that part still happens. The model produced it confidently — that still happens too. But the gate wasn't asking the model. The gate was asking a separate program whether the predicate was certified. The program said no. End of story.

10:08Cassidy: This is the moment where the philosophical claim and the empirical result land in the same place. The whole point of the architecture is that hallucination, *by construction*, cannot directly cause an unsafe action. It can only cause one indirectly — through some other failure. And they prove that with a piece of math the listener doesn't need to see, but the intuition is genuinely useful.

10:34Eric: Walk me through it.

10:35Cassidy: The proposition basically says: the probability that an unsafe action gets authorized is bounded by the sum of three named risks. One — the schema forgot to require some critical predicate. Two — there's a bug in the gate code itself. Three — a verifier was fooled into issuing a bad certificate. That's it. Three terms. There is no "the model hallucinated convincingly" term on the right-hand side.

11:01Eric: Which is the bank-vault decomposition. If money goes missing, the structurally possible explanations are: the blueprint was wrong, the construction was wrong, or someone defeated a specific lock. What's not on the list is "the money felt like leaving." You've converted vague, opaque risk into a finite list of investigable failure modes.

11:25Cassidy: And each one of those three terms is something you can put a number on. You can audit the schema for completeness. You can review the gate code. You can red-team each verifier and measure its false-positive rate. None of which you could do with "the model hallucinated."

11:43Eric: Now Cassidy, there's a comparison in this paper that I think is going to be the part people argue about for the next year, and I want to spend some time on it. Because the obvious response to all of this is: why are we building a whole separate evidence layer? Why not just use a smarter judge model? Take a frontier LLM, prompt it to check whether the proposed action looks safe, and let it veto bad ones.

12:11Cassidy: The LLM-as-judge play.

12:13Eric: The LLM-as-judge play. It's everywhere right now. Content moderation, agent guardrails, constitutional AI — the bet is that as models get smarter, they get better at refereeing other models. So the authors take that bet seriously. They run a state-of-the-art frontier model as a Progent-style safety judge — schema-aware, gets to inspect the action and the context, gets to reason about whether it's safe.

12:41Cassidy: And?

12:41Eric: On the five-hundred-task evaluation, that judge allows unsafe actions ninety-nine point two five percent of the time.

12:49Cassidy: Ninety-nine percent.

12:51Eric: And the response to that — the response anyone in the field would have — is, well, you didn't try hard enough. Give the judge chain-of-thought. Let it think step by step. Give it self-consistency across multiple samples. Let it deliberate over multiple turns, adversarially, checking its own work. So they do all of that. They build four progressively more elaborate versions of the judge.

13:18Cassidy: How far does it get?

13:19Eric: The best variant still allows seventy-nine percent of unsafe actions. And the benign success rate — the rate at which it correctly approves *legitimate* actions — drops to seventy-one percent. So you've made the agent both less safe and less useful at the same time.

13:38Cassidy: That's a damning result. Not because seventy-nine is high — though it is — but because of what it implies about the limit. You can throw all the standard interventions at this. Chain-of-thought. Self-consistency. Multi-turn adversarial deliberation. You don't get below seventy-nine percent. The fragility isn't a tuning problem. It's structural.

14:00Eric: And here's the analogy that nailed it for me. If you're skeptical of someone's account of events, the response is not to ask the same person to re-narrate it more carefully, with more reasoning, several times in a row, while consulting with themselves. That's still one source. The fix is to get an independent source. Chain-of-thought and self-consistency are the same model checking itself in a slightly fancier way. It does not escape the underlying fragility.

14:29Cassidy: There's a latency number in here too that I think the listener will appreciate. The deterministic gate decides in about four-tenths of a millisecond. The neural judge — about four and a half seconds. So you're roughly eleven thousand times slower for a worse safety result.

14:46Eric: Eleven thousand times slower to be wrong four times out of five.

14:51Cassidy: And I think that's where the paper makes its most interesting bet, philosophically. The intuition a lot of people have — that we'll solve agent safety with smarter, more thoughtful safety-judging models — that intuition runs straight into this wall. The paper's quiet claim is that trust reasoning inside the same model that proposes the action is structurally fragile, and the right place to enforce safety is *outside the model entirely*, in code that doesn't reason.

15:21Eric: That's a different bet than most of the alignment community is making right now. Most of the work I see assumes the safety mechanism gets smarter as the models get smarter. This paper is saying: no, the safety mechanism should not be a mind at all. It should be a gate.

15:38Cassidy: Okay — but I want to be careful here, because there's a steelman to voice. Eric, what's your read on where this work is honestly weakest?

15:48Eric: A few places. The first is that the biggest number in the abstract — they evaluate the gate on seventy-four hundred traces from six benchmarks — that result uses what they call oracle certificates. Which means the gate is being asked to decide assuming the evidence layer worked perfectly. The authors are admirably honest that this is a sanity check. It tells you the gate logic is correct. It does not tell you the deployment system works. The deployment-relevant numbers are the smaller ones — two hundred end-to-end tasks, a hundred and twenty live browser tasks. Both at zero percent unsafe actions, which is great, but on a much smaller sample.

16:35Cassidy: That's fair. The headline replays are gate logic. The deployment claims are smaller.

16:41Eric: Second, when they red-team the verifiers themselves — nineteen hundred attacks across nineteen categories — the initial bypass rate is fifteen percent. They get it down to one point three percent after four targeted fixes. DOM provenance cross-referencing, Unicode confusable detection, accessibility-tree integrity checks, perceptual hashing for OCR. Those are real fixes. But the honest reading is that one point three percent is the residual against the attacks they thought to test. It's not a guarantee against novel attackers.

17:19Cassidy: There's a moment in the paper that drives that point home, which is what happens when an attacker can coordinate across multiple unhardened verifier channels simultaneously. Their proposition about cross-modal corroboration assumes the verifiers fail independently. When they actually compose attacks across un-hardened DOM and OCR channels at the same time, the joint failure rate hits a hundred percent. Independence breaks under coordinated attack. They acknowledge it explicitly. But it means the multiplication trick — the idea that combining verifiers makes you safer because the false positives multiply — only works when each channel is individually hardened. Which loops back to: we need to find and fix every attack category, which is a moving target.

18:10Eric: And third — schema completeness. The whole architecture rests on having correctly enumerated, for each tool, which predicates need to be certified for that action to be safe. For the five action schemas they test — click, type, open-url, send, extract — they get to a hundred percent recall, but only with expert human sign-off as the final step. They tested an automatic version: just ask an LLM to write the schema. Forty-six percent predicate recall. More than half the safety-critical conditions get missed.

18:45Cassidy: Which is its own kind of vindication of the paper's thesis, by the way — if you can't trust a model to *write* a safety schema, you definitely can't trust the same kind of model to *enforce* it from inside.

18:59Eric: Yeah, that's a nice point. But it does mean that for arbitrary new tool families — robotics, code execution, database admin, hospital systems — the burden of correctly enumerating action-critical predicates is unbounded, and it depends on human expertise the paper can't quantify. The architecture is only as good as the schemas you write, and writing schemas is hard.

19:25Cassidy: So this isn't a solved problem. It's a containment strategy with a clean shape, an honest residual, and a finite list of things that have to be true for it to work.

19:36Eric: Which I think is the right way to read it. And honestly — that's already a lot more than most of the safety field is offering right now. Most current defenses don't even have a residual you can name. They just have a vibe.

19:52Cassidy: A vibe. Yeah.

19:53Eric: I'm being slightly unfair. But the structural point holds. If your safety mechanism is "we trained the model to be careful," your residual is "however careful the model isn't." You can't measure that. You can't patch that. You can only hope the next model is better.

20:12Cassidy: Whereas if your safety mechanism is "we built a gate with a schema and a set of verifiers," your residual is "schema gap, gate bug, verifier false positive." And every one of those has a workflow. Schema gap — audit and add a predicate. Gate bug — fix the code. Verifier false positive — red-team and harden.

20:33Eric: There's one more concrete number from the red-team that captures this perfectly. They run a parser stress test. Out of four hundred and ninety-three content-level bypass attempts — homoglyphs, zero-width characters, the usual Unicode mischief — zero succeed. The content parsers are well-hardened. Out of six hundred attempts at *metadata-structural* bypasses — attacks on how fields are structured rather than what's in them — four hundred and twenty-eight succeed.

21:06Cassidy: That is the architecture's epistemic value in one paragraph. They know exactly where the weakness is. They can point to it. They can prioritize the fix. Compare that to a neural judge that fails eighty percent of the time — you can't point at *anything*. There's no specific neuron to blame, no specific prompt to revise, no specific attack class that's the problem. The failure mode is just "the model was wrong."

21:35Eric: Right. The paper makes failure legible. That's the move.

21:39Cassidy: I want to close on the broader picture, because I think this paper sits inside a slow shift in how the field is thinking about AI safety. There are two instincts. One instinct is that the model should do everything, and we make the model better — train it harder, give it better prompts, layer judges on top, scale up. The other instinct is that the model is fundamentally a fuzzy component, and the right move is to wrap it in non-fuzzy infrastructure that constrains what it can cause to happen.

22:11Eric: This paper is firmly in the second camp.

22:14Cassidy: Firmly in the second camp. And it's doing it in a really concrete way. Not "in principle we should constrain models." But: here is a specific pattern, here are the specific predicates, here are the specific verifiers, here is the gate, here is the residual, here are the three named ways it can fail. Copy this if you're building an agent.

22:36Eric: And the line they end up landing on — which I think is the line that's going to stick — is that *model language may propose actions, but external evidence must authorize them*. That's the design principle in one sentence.

22:51Cassidy: It's a separation-of-powers claim, applied to a stack that didn't have one. The model is the proposer. The verifiers are the witnesses. The gate is the judge. And the judge does not consult the proposer about whether the witnesses are telling the truth.

23:08Eric: For tool-using agents specifically — which is where the actual deployment risk concentrates right now — I think this is going to be a load-bearing pattern. Maybe not exactly this implementation. But the shape: enumerate predicates, build narrow verifiers, gate deterministically, treat model text as inadmissible. That shape is going to show up everywhere over the next couple of years.

23:33Cassidy: And the reason I think it lands is that it gives operators something they don't currently have, which is a residual they can point at. When the system fails — and it will fail — you can ask: was it a schema gap, an implementation bug, or a verifier bypass? That's a question with an answer. That's an engineering discipline. That's a thing you can build a team around.

23:57Eric: As opposed to "the model was wrong, again, and we're not sure why."

24:01Cassidy: As opposed to that.

24:02Eric: The show notes have a link to the paper and some related reading if you want to keep pulling on this thread. And if you want the full transcript with the jargon glossed inline, plus links to the other episodes where these ideas show up, that's all on paperdive dot AI.

24:20Cassidy: Thanks for listening to AI Papers: A Deep Dive.