How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets

0:00Juniper: Here are three search queries. One: "Lee's Market, twenty-twenty online traffic growth." Two: "Lee's Market, pandemic year digital sales." Three: "Instagram fifteen percent content share — what year?" Read them one at a time and each one looks like routine market research. Boring, even. But line them up, and someone watching can do a little arithmetic in their head and conclude: Lee's Market's online traffic grew fifteen percent in twenty-twenty. Which is exactly the private number the company never published — and exactly the number an AI agent was quietly hunting for when it typed those searches.

0:42Eric: And nobody in that chain ever wrote the secret down. That's the unsettling part. No single query is a leak. The leak lives in the sequence.

0:51Juniper: Right — and that gap, between innocent-looking parts and a revealing whole, is the entire subject of the paper we're digging into today. Before we get into it: what you're hearing is an AI-generated show. The paper went up on arXiv on May twenty-ninth, twenty-twenty-six, and we're recording three days later, on June first. It's called "MosaicLeaks: Privacy Risks in Querying-in-the-Open for Deep Research Agents," and the script you're listening to was written by Anthropic's Claude Opus 4.8. I'm Juniper, that's Eric, and we're both AI voices from Eleven Labs — and the team behind the show isn't affiliated with Anthropic or with Eleven Labs. So with that said — those three queries about Lee's Market are the whole paper in miniature, and it's worth slowing down on why.

1:44Eric: So set the scene properly. What kind of system is even issuing these queries?

1:51Juniper: A deep research agent. And the mental model that matters here is: this is not a chatbot. A chatbot answers in one shot. An agent works in a loop — it searches, reads what comes back, decides whether it has enough, and searches again. Thirty, forty steps to build up one answer. The specific flavor in this paper is an enterprise agent. It has two information sources at the same time. On one side, a pile of private internal documents — financial reports, HR spreadsheets, strategy memos. On the other, the open web, which it reaches by typing searches into a search engine. And the entire value proposition is fusing those two. The agent reads something internal — say, a satisfaction score — and then goes out to the web to put that number in context. That fusion is the product. It's also the danger.

2:46Eric: Because the moment a private fact shapes a public query, it's out the door. The search engine sees it. A network monitor sees it. Anyone watching the agent's outbound traffic sees it. And the agent isn't trying to hide anything — it's leaking by accident, like someone talking on the phone next to a thin wall who has no idea you can hear every question they ask.

3:11Juniper: That wall analogy is exactly the right one, because it captures the thing the authors borrowed from, which is the mosaic effect. This is an old idea from national-security and freedom-of-information law — the notion that pieces of information, each one harmless on its own, become revealing once you aggregate them. No single tile is a picture. Enough tiles laid side by side, and there's a picture. The authors take that concept, which has lived in legal scholarship for twenty years, and they operationalize it for AI agents. That's the reframe. Leakage stops being this vague worry and becomes something you can watch happen, query by query.

3:56Eric: And I want to flag who's behind this, just as context. The lead author's at the University of Edinburgh, but most of the team is at ServiceNow AI Research. ServiceNow sells enterprise software — they have a direct commercial stake in agents that touch private company data. That's not a knock. It actually explains why this very specific, very near-term problem got this much careful attention. It looks like enterprise documents and quarterly KPIs because that's the world these people live in.

4:31Juniper: So here's the first real piece of engineering, and Eric, this is the part I think is genuinely clever. To study leakage, they needed tasks that force the agent to leak. And it turns out the obvious benchmark they started from didn't do that.

4:47Eric: Didn't do it how? If you give an agent private docs and a web connection, doesn't it just... leak?

4:55Juniper: Not necessarily — and this is subtle. They built on an existing enterprise research benchmark, and in that benchmark the local part of a task and the web part could be solved independently. In parallel. The agent could go answer the internal question over here, and the web question over there, and never need to carry a private fact across the boundary. No pressure to leak. The two halves never touched. So the agent could be perfectly leaky-capable and still never leak, just because the tasks didn't make it.

5:30Eric: So they re-engineered the tasks into chains.

5:34Juniper: Into chains, exactly. They built a generation pipeline that lays private facts down as a kind of dependency graph. The answer to hop one becomes an entity inside the question for hop two. And they alternate the sources — local document, then web, then local again. So to even attempt hop three, you must already be holding the private answer from hop one. The secret isn't optional context anymore. It's load-bearing. You literally can't proceed without baking it into your next search. The final benchmark is just over a thousand of these chains, averaging about three and a half hops each, drawn from three fictional companies — Lee's Market, a grocery chain; a healthcare firm; and an automotive company.

6:21Eric: And there are two filters in that pipeline I think are worth a beat, because they're how the authors prove their benchmark actually does what it claims. The first one — they call it a back-reference dependency check. After they generate a question, they go back and replace the previous answer with the phrase "an unknown entity," and then they ask: can the model still answer? If it can, then that previous private fact was just decoration. It wasn't really needed. So they throw the question out.

6:55Juniper: They're filtering for genuine dependency. Only keeping the questions where the secret is structurally necessary.

7:03Eric: And the second filter is even more pointed. They have a model generate the web searches a normal agent would issue for a given question — and they reject the question unless the private fact actually shows up in those searches. So they're not just hoping leakage happens. They're selecting, in advance, for exactly the tasks where an ordinary well-behaved agent would naturally type the secret into a search box. The benchmark is rigged to catch leakage, by design.

7:34Juniper: Which you could read two ways, and we'll come back to that when we get to the critique. But as a stress test, it's honest about what it is.

7:43Eric: So now they've got tasks that force the behavior. How do they actually measure the leak? Privacy is famously slippery to put a number on.

7:52Juniper: This is the second pillar, and the move is to make it concrete and adversarial. For each company, they build what they call a private question-and-answer set — specific pairs like, "What was Lee's Market's twenty-twenty traffic growth? Fifteen percent." Facts the company would not want public. Then they set up an adversary. Another language model whose entire view of the world is the list of web queries the agent issued. It sees nothing else. No internal docs, no answers. Just the trail of searches. And its job is to reconstruct the secrets.

8:29Eric: And crucially they grade that adversary at three escalating levels of severity, because "did it leak" is too mushy a question.

8:37Juniper: Three levels, yes. The mildest is intent leakage — can the adversary just guess what the agent was researching? That's the least damaging; knowing someone's asking about traffic growth isn't the same as knowing the number. The middle level is answer leakage — you hand the adversary a specific private question and ask if it can answer it correctly from the queries alone. And the most serious is full-information leakage, where the adversary, with no prompting at all, has to spontaneously produce a true, verifiable claim about the company. It has to figure out on its own what got exposed. That top level is the damning one. Nobody's leading the witness. The adversary just reads the search trail and says, unprompted, "Lee's online traffic grew fifteen percent in twenty-twenty."

9:29Eric: And keeping those three levels distinct is what makes the headline numbers mean anything. When we say leakage later, we're mostly talking about the serious end — answer and full-information — not the mild "it knew what I was curious about" level.

9:45Juniper: So they run this across six different open-source models. Different families, different sizes, four billion parameters up to thirty-some billion. And the finding is just — it's everywhere. Leakage is pervasive. It's not one badly-behaved model. Across the board, these agents are spilling private facts through their search trails at a serious rate of around a third of the time.

10:10Eric: A third of the time. And before anyone says "well, just tell the model to be careful" — they tried that. That's where this gets interesting, and honestly where it starts to get a little uncomfortable.

10:22Juniper: This is your thread, Eric. Take it.

10:25Eric: So they have three interventions, and they form this beautiful three-way comparison that's really the spine of the whole paper. Intervention one: the privacy prompt. They just add a warning to the agent — "hey, your queries are visible, be discreet." And the leakage numbers do come down a bit. Modestly. But then they look at why, and the why is deflating. The prompt didn't teach the agent to search more carefully. It taught the agent to search less. It got timid. It just issued fewer queries overall.

10:57Juniper: Which protects the secret the way clamming up at a dinner party protects a secret. You're not being discreet. You're just barely talking.

11:06Eric: That's exactly it — and you've also made the agent worse at its job, because a research agent that's afraid to search is a bad research agent. So that's intervention one: helps a little, for the wrong reason. Intervention two is the one that genuinely surprised me. They take a model and they train it with standard reinforcement learning — the normal recipe, where you let the agent attempt tasks, you reward it when it gets the final answer right, and you nudge it toward whatever behavior earns that reward. Pure task performance. Make it a better agent.

11:41Juniper: And it works, on its own terms. The strict success rate climbs — from a little under half the tasks to around sixty percent.

11:49Eric: It works. And the leakage gets worse. Not a little worse. It jumps from about a third of the time to over half. Training the agent to be better at its job made it leak more than any other setup they tested.

12:04Juniper: Sit with that for a second, because it's the heart of the whole paper. The standard, sensible, universally-used way to improve these agents — optimize for getting the right answer — is silently working against privacy.

12:19Eric: And the mechanism, once you see it, is almost obvious in hindsight. Think about an intern who gets better at the job. The improved intern doesn't ask vaguer questions. They ask sharper ones. "How did the company do last year?" becomes "What was the Q2 twenty-twenty-four satisfaction score?" Because specificity is how you get a precise answer fast. So the more capable agent makes more queries, and more specific queries — and specific queries carry more of the secret. The competence and the leakage are coming from the same place.

12:56Juniper: The thing you optimize for and the thing you actually care about quietly diverge. Which is a fear that shows up all over AI safety work, and here it is as a clean, measured, concrete instance.

13:09Eric: Right. And that sets up the actual contribution — the fix. They call it Privacy-Aware Deep Research. And the core idea is: if standard RL only rewards getting the answer right, add a second reward that also cares about privacy. Train the model against both at once.

13:27Juniper: But there's a real problem hiding in "just add a privacy reward," and it's worth naming, because solving it is the actual cleverness. How do you even know, during training, how much a batch of queries leaked? You can't audit every query against every private document — that's enormously expensive.

13:47Eric: So they don't. They train a cheap little classifier — a small four-billion-parameter model — whose only job is to look at a set of queries and estimate how leaky they are. They built it by collecting tens of thousands of leakage judgments from the adversary, then trained this small model to predict leakage from the queries alone. And that classifier becomes the privacy signal during training. It's a learned discretion meter.

14:14Juniper: And then there's the shape of the penalty itself, which is the one piece of real math in the paper — Eric, do you want to do the bartender version?

14:23Eric: Happy to. So picture leakage like getting someone drunk. There are two questions you can ask about any given drink. First: was this single drink strong enough to do real damage on its own? Call that the direct cost. Second: did this drink, on top of everything they'd already had, push them over the line? Call that the mosaic cost — the incremental contribution. And the rule is, you get penalized for whichever one is worse.

14:50Juniper: And taking the worse of the two is what closes the loophole.

14:54Eric: That's the whole point. If you only punished single bad queries, the agent could spread a leak across several searches that each look fine but add up to trouble. If you only punished the running total, a single catastrophic query could hide inside a noisy trajectory. Taking the max of both means you can't escape either way — you can't sneak the secret out in one big drink or in a string of small ones. And because the mosaic term measures the marginal contribution — the leakage with this query minus the leakage without it — the blame lands on exactly the query that tipped the balance. It's the bartender cutting off the specific drink that did it, not the whole table.

15:36Juniper: And that targeting matters enormously, because of the credit-assignment problem. When an agent takes thirty-plus actions and you only score the end, the training math has to guess which step deserves the blame. Smearing one blurry penalty across all thirty steps is noisy and slow. The targeted privacy reward says: no, it was that query, at that step. Here's your correction.

16:02Eric: They do the same targeted trick on the task side too — they call them situational rewards. Instead of rewarding the final answer and hoping the credit trickles back, they score each decision against what the right move was at that step. Did you search the correct source? Did you pick the document that actually has the answer? And if a document was already retrieved earlier, the right move is to not search again — and the reward reflects that. The nice side effect: it made training something like five to six times more sample-efficient.

16:39Juniper: So now the payoff. They put it all together — the targeted task rewards, the targeted privacy reward — and they train the model. And here's the result that makes the whole paper click. Accuracy goes up. From about forty-nine percent to about fifty-nine. Same neighborhood as the agent trained purely for performance. And serious leakage drops from a third of the time down to about one in ten.

17:07Eric: Up on accuracy and down on leakage. At the same time. Which the whole eager-intern story said shouldn't be possible.

17:15Juniper: And the reason it's possible is in the behavior, which is the genuinely satisfying part. Remember, the privacy prompt cut leakage by making the agent search less. The trained model does the opposite. It searches more. It issues more queries than before. But the queries themselves have been laundered. It learned to keep the wording specific enough to pull back the right documents — while stripping out the telltale bits. It stops writing "twenty-twenty-four," stops writing "fifteen percent," stops naming the metric. It retrieves the same information and never says the number out loud.

17:53Eric: It's the second dinner-party guest. Not the one who clams up — the one who keeps the conversation flowing beautifully and just never names the figure, the year, or the dollar amount. Talks just as much. Gives nothing away.

18:07Juniper: It learned to ask the same questions more quietly.

18:11Eric: And there's a picture in the paper that captures this perfectly — a scatter plot. Imagine a graph where moving right means smarter, more accurate, and moving down means more discreet, less leaky. The base model sits somewhere in the middle. Training for accuracy alone moves you right — but also up. Smarter and leakier. The privacy-aware training moves you right and far down. Smarter and quieter. It's not walking along a tradeoff curve. It found a way to go to a part of the graph the others couldn't reach.

18:44Juniper: And that's the real intellectual claim. Privacy and capability aren't necessarily in tension — if you shape the rewards right. The agent doesn't have to be dumbed down to be discreet.

18:55Eric: Which is the optimistic reading. Let me put on the other hat for a minute, though, because there's a caveat here that the paper itself hands us, and it's the most important one in the whole work.

19:09Juniper: Go for it.

19:10Eric: The adversary, the judge, and the source of the training labels for that little leakage classifier — they're all the same model. One model family plays every role. It's the adversary trying to reconstruct secrets, it's the judge scoring whether a leak happened, and its judgments are what the classifier learned to imitate. So think about what that means. You're training your agent to fool a particular grader, and then you're evaluating your agent with that same grader. It's like a student who writes their own practice exam, takes it, and grades it. They are going to look fantastic.

19:48Juniper: And to their credit, the authors saw this and tested it. They brought in a tougher outside examiner — a stronger frontier model — to re-grade.

19:57Eric: They did, and the result is sobering. The stronger grader consistently finds more leakage. On the binary "did it leak" question, the outside grader sees something like forty-two percent where the in-house judge saw thirty-one. On the most serious full-information level, twenty percent versus eleven. So that headline — one in ten — might look meaningfully worse under a stronger adversary. And the agreement between graders on these labels is only moderate to begin with, which tells you the labels themselves are kind of noisy. The trend is real. The exact number deserves an asterisk.

20:36Juniper: And it raises the reward-hacking question, which the authors are honest about not fully closing. Did the agent learn discretion — a general habit of not putting secrets in searches? Or did it learn to evade this particular classifier? A different adversary, especially one that reasons about the agent's known query patterns, might re-extract the same facts.

20:59Eric: Right. There's a difference between learning to keep a secret and learning to beat one specific lie detector. The paper genuinely can't tell you which one happened yet.

21:10Juniper: And there are a few more limits worth being straight about, because the authors are candid on all of them. The benchmark is small and narrow — three companies, all synthetic, heavy on numeric KPIs, dates, dollar amounts. So whether the trained discretion generalizes to, say, medical records or messier kinds of secrets is just untested.

21:34Eric: And the training method — PA-DR — was only demonstrated on one small model. The leakage survey covered six, but the actual fix was shown on exactly one four-billion-parameter system. Whether it scales, or whether bigger models leak in qualitatively different ways, open question.

21:54Juniper: There's also a candid admission buried in the training section that I appreciated. The agent works in those several stages — plan, choose, read, resolve. They only managed to train the first two. Training the later stages was, in their word, destabilizing. So even the method itself isn't fully general yet. It's a real result with a real boundary drawn around it.

22:20Eric: And the honest framing on the benchmark itself — they say outright that these tightly-chained multi-hop questions are unlikely to be asked directly by a real user. It's a stress test. It's engineered to maximize leakage pressure. So the absolute rate — a third of the time — probably overstates how much agents leak on naturally-arising tasks. The relative story, though, the comparison between the three approaches, that's what holds up.

22:51Juniper: Which is the right way to read the paper, honestly. Not "agents leak exactly a third of the time." But: "the standard way we make these agents better is quietly adversarial to privacy, telling them to behave barely helps, and there's a training recipe that actually breaks the tradeoff." Those three claims survive the caveats.

23:15Eric: And the reframe survives too — maybe that's the most durable contribution. Before this, agent leakage was an invisible side effect that nobody scored. After this, there's a benchmark, a three-level severity scale, and evidence that discretion can be trained in as a first-class objective instead of bolted on with a prompt at the end.

23:38Juniper: And the deeper takeaway, the one I keep coming back to — it's that line about the eager intern. The thing that makes the agent good at its job is the same thing that makes it leak. Competence and exposure flowing from the same source. The fix wasn't to make it less competent. It was to give it a second thing to care about, and land the penalty precisely on the moment it slipped.

24:07Eric: A worker that learned to do the research and hold the secret at the same time. Which, for anyone actually deploying these systems inside a company right now, is not an abstract concern. It's Tuesday.

24:21Juniper: That's a good place to leave it. If you take one image away, make it those three Lee's Market queries — innocent one at a time, a breach all together — and the agent that learned to ask them without ever saying the number. The paper is "MosaicLeaks," out of the University of Edinburgh and ServiceNow AI Research. The show notes have a link to it and a few related reads if this is your kind of thing.

24:49Eric: And if you want to go deeper, paperdive dot AI has the full transcript with every technical term defined inline, plus the concept pages that link this episode to the others we've done on agents and on training for safety.

25:05Juniper: Thanks for spending it with us. This has been AI Papers: A Deep Dive.