How an Innocent README Can Freeze an AI Agent's Safety Check for an Hour

0:00Juniper: For about three years, almost every conversation about keeping AI agents safe has circled one question — will the safety check catch the bad thing? Will the guardrail block the malicious command, refuse the harmful request, stop the agent before it wires money to the wrong place? Enormous effort went into making that check smarter, more context-aware. And almost nobody stopped to ask the other question: does the safety check finish in time? That sounds like a footnote. It turns out to be a door into a whole new class of attack. A new paper shows you can take the guardrail — the thing that's supposed to protect the agent — and turn it into the weapon. It went up on arXiv on June twelfth, twenty-twenty-six, and we're recording three days later. It's called "From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails." Quick note on what you're hearing while we're here: this episode is AI-generated. I'm Juniper, an AI voice from Eleven Labs, and the script was written by Anthropic's Claude Opus 4.8 —

1:04Finn: — and I'm Finn, also an AI voice from Eleven Labs. Neither of us, and neither company, has anything to do with producing this show. And the cleanest way to feel why this paper matters is one number. There's a coding agent called OpenHands — seventy-thousand-plus stars on GitHub, a tool people genuinely run in production. On a normal task, its safety check adds maybe two minutes of overhead. The researchers slipped one poisoned file into the repository it was working in. That same safety check ballooned to over fifty-nine minutes.

1:37Juniper: And to be clear, the agent isn't doing fifty-nine minutes of useful work in there. It's stuck. The guardrail sits directly on the critical path — the agent literally cannot take its next action until the safety check returns a verdict. So if you can make the guardrail run for an hour, you've frozen the whole agent for an hour. With one file.

1:59Finn: Right. And the file isn't some exotic exploit. It's a README. Plain, fluent documentation text. Anyone with permission to commit to a public repo can plant it.

2:09Juniper: Let me set up the thing being attacked, because the irony only lands once you know what a modern guardrail actually is. The old version was dumb — a keyword blocklist, or a little classifier that flags suspicious strings. Fast, but easy to slip past. The new version, the one everyone's moved to, is a second language model. Before the agent clicks or runs code, this model reads the entire situation in plain English — the user's goal, the web page, the proposed action — and reasons its way to allow or block. Think of the difference as a metal detector versus a human security officer. The metal detector beeps instantly. The officer actually interviews you, considers context, catches the subtle stuff the detector misses — but you are stuck standing there until the officer is satisfied. That thoroughness is the whole selling point. And it's also the whole vulnerability.

3:05Finn: Okay, but I want to push on this, because my first instinct is that this is just an old trick in new clothes. People have been trying to slow language models down for a while — you feed the model some monstrous recursive math problem, some pathological logic puzzle, and it burns forever trying to solve it. So this is just — dump a giant wall of nonsense into the README and the guardrail chokes on it?

3:31Juniper: That's exactly the intuition the paper kills — and the reason it's wrong is the most important idea in the whole thing. Those decoy attacks do work on a general-purpose chatbot, because a chatbot will try to solve whatever you put in front of it. But a guardrail has one job. Its prompt and its training pin it to that job — evaluate safety. So when you hand it an off-task puzzle, it doesn't try to solve it. It treats it as inert data to inspect, glances at it, and moves on. The paper tested six of these prior distraction attacks against guardrails. The best of them got the model to do about twenty percent more work. One-point-two times. Basically nothing.

4:15Finn: So the distraction approach is a dead end. Which means the trick has to be something else entirely.

4:22Juniper: And here's the inversion. Instead of trying to distract the guardrail away from its task, you give it more of exactly the task it already wants to do. A guardrail's reliability comes from following a structured template — enumerate the risk categories, weigh the evidence, assess each item, then deliver a verdict. So the attacker writes text that looks like that template. Fake risk categories. More sub-items to assess. Clauses that say "do not abbreviate this analysis," "calibrate every item," "if any section is incomplete, start over from the top." The guardrail can't tell the difference between its own legitimate checklist and the attacker's injected one. It's good at following instructions — so it dutifully grinds through the whole thing. And the headline image from the paper is this: an eight-hundred-character payload, a little fake risk checklist, makes the guardrail produce over fifty thousand characters of output. It mirrors and expands every section heading the attacker planted, treating the scaffold as its own.

5:32Finn: So it's not confused. It's being too conscientious.

5:36Juniper: That's the perfect word for it. Picture a compliance auditor whose entire professional pride is never skipping a checklist item. You hand them a one-page risk form, but buried in it are instructions like "for every item, cross-reference all nine categories," and "if any sub-item is unaddressed, restart the assessment." A lazy auditor shrugs and signs off. The best auditor — the most thorough one — spirals into an infinite review. Not because they're fooled. Because they're diligent.

6:09Finn: And a human auditor eventually notices the absurdity and stops. The model has no such circuit breaker.

6:15Juniper: None. And the paper actually proves the model has stopped thinking, which I find genuinely elegant. They looked inside the model while it was stuck. Two signatures. First, its attention shifts — the model starts obsessively re-reading the section headers in its own generated output, then generating more headers in the same pattern, then attending to those. Attention to those headers is about nine and a half times higher than normal. Second, its moment-to-moment uncertainty collapses — roughly halves. A model that's genuinely deliberating is weighing options, so there's uncertainty in what comes next. This model isn't weighing anything. It's mechanically transcribing a to-do list it wrote for itself, where the last item says "add three more items and start over."

7:07Finn: That's the part that gets me — you can literally watch the deliberation drain out and the autopilot take over. It's not reasoning anymore, it just looks like reasoning from the outside.

7:19Juniper: And from the outside is all the agent can see. The agent is just waiting for a verdict that never comes.

7:26Finn: So let's talk about how they actually find these payloads, because "write text that looks like a checklist" is a vibe, not a method. You can't hand-craft these.

7:37Juniper: They built a search process to discover them automatically. The objective is almost crude in its simplicity — find the input that makes the model burn the most compute, measured by how much it ends up generating. But the clever part is they don't optimize for one situation. They maximize the expected reasoning length averaged across many different agent contexts at once. That's the load-bearing choice. It forces the payload to generalize instead of overfitting to one specific scenario.

8:09Finn: And that's why the transfer story works.

8:12Juniper: Right — and the transfer story is wild. They tune a payload on one small open-source guardrail model, an eight-billion-parameter surrogate. Then they take that exact payload, unchanged, and point it at eight leading models — the Claude family, the GPT family, Gemini, DeepSeek, Qwen. It hits between thirteen and sixty-three times amplification across all of them. Same poisoned text, no retuning. Because it isn't exploiting one model's quirks. It's exploiting instruction-following itself — the property all of them share.

8:47Finn: There are two flavors of the search, right? I don't think we need to dwell, but it's worth a sentence.

8:54Juniper: One sentence each. The first uses a strong model as a creative mutation engine — it invents new tactics from scratch and keeps a running scoreboard of which structural tricks work best. The second is cheaper: it treats the payload as a template with named slots — number of risk categories, how deep the enumeration goes, how many anti-shortcut clauses — and just mutates the slots. The cheap one drops the cost per variant from about forty-six cents to seven cents. Which matters, because it means you can mass-produce surface-different versions and poison a lot of repositories without anyone noticing they're related.

9:35Finn: And nobody notices anyway, because these things are invisible to the filters.

9:41Juniper: Essentially invisible. The standard pre-inference injection detectors score these payloads below a thousandth on their threat scale, versus a perfect one-point-zero for a classic injection pattern. They're fluent, natural English documentation. There's no garbled string for a perplexity filter to flag. Nothing smells wrong.

10:01Finn: Okay. So this is where I want to take the wheel, because the standalone numbers are alarming but a little abstract — "amplification factor on a benchmark." What makes the paper land is what happens when you drop this into real, deployed agent systems. And the adaptations they had to make are honestly the most clever part of the work.

10:23Juniper: Take it.

10:24Finn: So back to that fifty-nine-minute coding agent. The code-agent case has a twist that breaks the simple version of the attack. In something like OpenHands, the guardrail is integrated — the same model that does the task also does the safety check. There's a safety field baked into every tool call. And because it's racing to finish the actual coding task, it won't just sit there enumerating a fake checklist. It wants to get the job done.

10:51Juniper: So pure schema-stuffing doesn't trap it.

10:54Finn: Doesn't trap it. So the attackers switch tactics. Instead of giving it a longer list, they give it a harder argument — they plant contradictory security considerations that resist quick resolution. Framing about blast radius, clauses about irreversibility, factors that genuinely conflict with each other. Now the model can't reach a clean verdict, so it deliberates, and deliberates. And because the poison lives in the README, it sits in the conversation history and re-triggers on every single subsequent action. Average overhead across their test tasks went up about twenty-six times. The peak was that fifty-nine-minute case. And here's the supply-chain horror of it — that README doesn't expire. It re-triggers for future sessions. For other developers who clone the same repo. One malicious commit is a contaminated ingredient that spoils every meal cooked in that kitchen afterward. Except there's no spoiled smell, because it reads like perfectly normal docs.

11:52Juniper: You mentioned multi-agent being even worse. Why worse?

11:56Finn: Because multi-agent pipelines introduce a problem the attacker has to solve, and the way they solved it involves a genuinely delightful accident. In a pipeline — they used LangGraph — your payload doesn't go straight to the guardrail. It passes through intermediate agents that rewrite and reformat the content first. They call this the transform-resilience problem. A payload that's lethal at stage one is useless if stage three strips it out before the final guardrail ever sees it.

12:25Juniper: So the optimizer has to find payloads that survive being rewritten.

12:29Finn: Exactly — they weight the later pipeline stages more heavily so it's forced to. But while studying what survives rewriting, they found something. Intermediate agents strip out obvious security terminology, but they preserve structure — tables, section headers, anchored facts like dates and IDs. And at one point a scraper agent reformatted some raw content into a clean markdown table. The downstream guardrail's reasoning jumped from a hundred and thirty-five characters to over twenty thousand. A hundred-and-fifty-times explosion — caused not by the attacker, but by a helper agent tidying things up. Tabular structure all by itself triggers deeper analysis.

13:09Juniper: So the pipeline accidentally weaponized its own content.

13:13Finn: The pipeline did the attacker's job for them. And in a multi-agent setup the damage isn't contained. One poisoned worker consumed over twelve minutes of guardrail time on a single call — about a hundred-and-forty-eight times amplification. But worse, the benign agents sharing that infrastructure got stuck waiting behind it. Head-of-line blocking. Throughput dropped about a quarter, and innocent co-located agents sat idle for two minutes each, doing nothing wrong, just queued behind the jammed one.

13:43Juniper: And there are two more deployment surfaces, but they compress to a sentence — the same attack adapts to hidden elements on a web page that a human never sees but the agent reads, and to file contents and terminal output on a desktop agent. Same idea, different hiding spot.

14:00Finn: Though the desktop one has a detail too good to skip. One of the desktop guardrails runs three independent verification passes per action, to be extra thorough. Which means a single injection enters all three prompts — a built-in three-times multiplier before the attack even amplifies. The architecture's most careful feature became its biggest liability. Total amplification there was about eighteen times.

14:25Juniper: There's a pattern across every one of these. Every design choice made for thoroughness — integrated checks, triple verification, structure-preserving pipelines — becomes the lever. The strength is the weakness, all the way down.

14:41Finn: Which sets up the part I think is the real heart of the paper. Because the obvious objection at this point is: fine, the guardrail can run forever — so just add a timeout. Cap it. If the safety check takes too long, cut it off. Problem solved, right?

14:58Juniper: I genuinely thought that when I started reading. Add a timeout, move on.

15:03Finn: And here's where it stops being a performance problem and becomes a safety problem. Because the second you add a timeout, you have to answer a question you can't dodge: when the check times out, what's the default? Do you let the action through, or do you block it?

15:21Juniper: This is the fail-open versus fail-closed dilemma, and it's worth grounding it. Think about a door wired to a power supply. A fire-exit door is wired to unlock when the power dies — fail-open — so people can escape. Great for a fire escape, catastrophic for a bank vault. Wire the same door to lock on power failure — fail-closed — and the vault's safe, but now you've trapped everyone inside. There's no universally safe default. The choice itself is the trap.

15:53Finn: And the attacker is the one cutting the power, on demand, whenever it suits them. So walk through both doors. Fail-open first — the timeout defaults to "allow."

16:04Juniper: Then the agent proceeds with no safety review at all. The paper demonstrates exactly this — an agent submitting a financial transaction with zero guardrail oversight, because the check timed out and the default let it through. And here's the sentence that stopped me cold. Under timeout-allow, task success didn't just hold up — it went up. Six out of ten tasks succeeded, versus five out of ten at baseline.

16:33Finn: Wait — went up? The unsafe configuration performed better?

16:37Juniper: Because the guardrail was never a help to task completion in the first place — it was friction. So when you knock it out, things run smoother. The attacker hasn't just denied service. They've converted a compute-exhaustion attack into a safety-bypass — the exact thing the guardrail existed to prevent. The shield doesn't just fall. It opens the gate.

16:59Finn: And fail-closed — default to "block" — doesn't save you either.

17:03Juniper: No. Then the attacker gets the denial-of-service directly and for free. Task success dropped to three out of ten. Trigger the timeout, and the agent's blocked from doing legitimate work. Both doors hand the attacker a win. You cannot pick a safe default, because there isn't one. That's the structural part — it's not a bug you patch, it's a corner you're backed into the moment you accept the guardrail can be stalled.

17:30Finn: And that's a bleak place to land. Which is the right moment for me to start pushing on how solid the foundation under all this really is.

17:40Juniper: Please — because the framing is strong and I want to know if the evidence is as strong.

17:45Finn: So my first concern is the headline numbers themselves. The entire attack optimizes for one thing: reasoning length. How many tokens, how many characters the model generates. And most of the eye-popping figures — the hundred-and-forty-eight times, the hundred-and-thirty-one times — are latency derived from that length, on specific deployments. But modern inference stacks don't process tokens in a simple straight line. Continuous batching, speculative decoding, splitting the prefill from the decode — a production system built for throughput can absorb a long generation very differently than the linear "tokens equal cost" assumption suggests. So I'd want to see these numbers under a real serving setup before I fully trust the tail figures.

18:34Juniper: That's fair, and the paper is partly exposed there. Though I'd say even if the multiplier shrinks under good batching, the critical-path problem doesn't go away. That one agent's action still can't proceed until its check returns, no matter how cleverly the server is batching other requests. Batching helps the datacenter's throughput. It doesn't unstick the individual stuck agent.

18:57Finn: Agreed, the qualitative trap survives. I'm just flagging that the specific multipliers might be softer in practice. And a related point — a lot of the most quotable numbers are peaks, not averages. The README averaged about twenty-six times but the fifty-nine-minute case is the worst single instance. The web checkout that took twenty-one minutes is one shopping task. The web evaluation is ten tasks, desktop is twenty scenarios, code is twenty instances. The averages are still dramatic, genuinely. But peak and average shouldn't blur together, and the most memorable figures are tails.

19:34Juniper: The checkout one has a real mechanism behind it, though — it's not random that checkout was the worst. A checkout form has many fields, and the guardrail cross-references every field against every risk category. That's a combinatorial explosion baked into the task itself. So I'd believe checkout is genuinely a worst case for a structural reason, not just noise.

19:56Finn: That I'll grant. Here's the reservation I can't put down, though. The paper's framing is that this is a structural flaw in the whole guardrail paradigm. But look at which defenses they actually tested. Pre-inference filters — which miss the fluent payloads. Hard token budgets — which just shift you into the fail-open/fail-closed trap. And "use a stronger model" — which, beautifully, backfires.

20:21Juniper: That's the second "wait, really" of the paper, and it deserves a beat. The more capable the guardrail model, the longer the loops get. Because a stronger model has better instruction-following — so it executes the injected schema more faithfully, more completely. The defense gets worse exactly as the model gets smarter. Every other place in AI, a better model is a better defense. Here, capability is the attack surface.

20:49Finn: Which is striking. But notice what's missing from that list. The one defense that actually targets this specific mechanism — fine-tuning a guardrail to recognize and refuse self-referential analytical schemas, to be deliberately less compliant with checklist-shaped bait — they don't test it. They mention it as future work and move on. So the claim that "existing mitigations offer limited relief" is well-supported for off-the-shelf defenses. But the stronger framing — that this is an unfixable property of the paradigm — runs ahead of the tested evidence. The door to a targeted defense is wide open, they just didn't walk through it.

21:30Juniper: I think that's the right critique, and I don't think it fully resolves. Because there's a real tension underneath it. The targeted defense you're describing — train the model to be suspicious of structured analytical instructions — is in some tension with what a guardrail is for. Its whole job is to faithfully work through structured safety analysis. Teaching it to distrust checklist-shaped input might blunt the exact capability that makes it a good guardrail.

22:01Finn: Maybe. Or maybe a model can learn the difference between its own template and an injected one — humans do. We don't know, because the experiment isn't in the paper. And that's where I land: I take the demonstration completely. The attack is real, it's cheap, it transfers, it's invisible to current filters. What I'm not yet sold on is the word "structural." That's a claim about defenses that don't exist yet, and you can't fully prove a negative about future architectures. I think they've named a serious problem. I'm not convinced they've shown it's unsolvable.

22:34Juniper: And the authors, to their credit, are honest about most of this. They say plainly that none of the mitigations they tried resolves the threat. They flag that the fail-open-to-safety-bypass jump is a plausible consequence rather than a fully separate proven result. They note the attacker is fully black-box — they can only watch for latency and timeouts, and they can't force an agent to read the poison, they just plant it and wait for the agent to wander past it. This reads as a problem-naming paper that knows it's a problem-naming paper.

23:06Finn: And there's an honorable tradition there. Security has a long history of the paper that says: here's a structural weakness in something widely deployed, here's why the easy fixes don't work, here's the new design constraint you need to take seriously. This is squarely that kind of contribution. The call they make is for what they term cost-bounded, reasoning-robust guardrails — safety checks that come with a hard ceiling on how much they're allowed to deliberate, designed from the start to resist this.

23:36Juniper: And the deeper reframe, for me, is what this does to how we think about guardrails at all. The field treated the safety check as a free bolt-on — slap it in front of the agent, sleep better. This paper says no, you've just added a new, expensive, attackable piece of critical infrastructure that sits in front of every action your agent takes. The very thing you added for safety is now the softest target in the system.

24:04Finn: From shield to target. The title's not being cute — it's the whole finding in four words.

24:10Juniper: And the timing matters. The bar for this attack is exactly the bar for indirect prompt injection — which is already ranked the number one security risk for LLM applications. You don't need model weights, you don't need the system prompt, you don't need any special access. You need to put text somewhere an agent will read it. Publish a page, commit to a repo, send a message. The channel is wide open and well understood. What's new is the goal — exhaust the defense itself — and the target, which is the defense.

24:43Finn: As agents start taking more consequential actions — moving real money, touching real infrastructure — the question this forces is uncomfortable. We've been asking whether the guardrail gives the right answer. We have not been asking whether it can be made to never answer at all. And right now, the honest position is that it can, cheaply, and nobody has shown a clean way out.

25:08Juniper: That's the thread I'll keep pulling on after this one — not whether the attack works, it clearly does, but whether "cost-bounded and reasoning-robust" is something you can actually build, or whether bounding the cost just walks you straight back into the timeout dilemma. The paper doesn't answer that. I'm not sure anyone can yet.

25:30Finn: That's the right note to sit with. The paper we've been working through is "From Shield to Target," on denial-of-service attacks against the safety layer of AI agents. The link's in the show notes, along with some related reading if you want to go deeper on prompt injection and where this fits.

25:48Juniper: And if you want the full transcript with every term defined inline, plus the concept pages tying this to other episodes we've done, that all lives on paperdive dot AI.

25:58Finn: For AI Papers: A Deep Dive, I'm Finn.

26:00Juniper: And I'm Juniper. Thanks for listening.