The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys

0:00Cassidy: A great carpenter is not automatically a great contractor. The contractor never picks up a hammer — they hire the electrician, sequence the jobs, hand each crew the keys to only the rooms they need, and sign off on the finished house. Now here's a finding that should stop you: researchers built a test that measures large language models purely on that skill — being the boss, not the worker — and every model they checked handed its workers roughly twice the file access they actually used. Not one of them cleared fifty percent. Quick heads up before we get into it — this is an AI-made explainer, and both voices you're hearing are AI too.

0:40Eric: And the twist that breaks the usual assumption: paying more barely helped. Across the twelve models they tested, the cost to run one job varied more than a hundred to one — from under a dollar to about ninety-three dollars. The management scores varied less than four to one. The cheapest open model was sitting right on the efficiency frontier — not dominated by anything, holding its own against models that cost twenty-five times as much.

1:08Cassidy: So the promise for the next stretch is simple. By the end you'll understand why "smart enough to solve the problem" and "good enough to run the team that solves it" turn out to be two different skills — and why the second one is missing at every price point we can currently buy.

1:25Eric: And why that matters beyond a leaderboard. The whole industry is starting to wire models up as managers — one model spinning up helper agents, handing out tools and file permissions, stitching the results together. That's shipping in real products right now. If the thing you're paying a premium for isn't actually the thing that makes a good manager, a lot of deployment assumptions are built on sand.

1:50Cassidy: Let me set up what "being the boss" actually means here, because it's more specific than it sounds. The paper is called ClawArena-Team, out of UNC Chapel Hill and collaborators, posted at the end of June 2026. And one quick flag before anyone goes hunting: the model names in here — things like claude-fable-5, gpt-5.5, deepseek-v4-pro — are forward-dated, fictional stand-ins. This reads as a near-future preprint. So think "frontier flagship" and "cheap open model," not something you can go download this afternoon.

2:26Eric: Right, and the reason management has been invisible is subtle. We have plenty of benchmarks that measure a model as a solo problem-solver — fix this bug, answer this hard question, finish this web task. And we have multi-agent frameworks where the roles and the wiring are set up in advance and you watch the system run. But nobody had isolated the manager itself. Every previous test tangled the boss's skill together with how good the workers were. If the team succeeds, was that good management, or just strong workers covering for a sloppy boss? You couldn't tell.

3:06Cassidy: So their fix is the clean part of the whole design — freeze the workers. Every model under test commands the exact same fixed pool of helper agents. Same three types, served locally so they're free and identical on every run. Now if outcomes differ, it's the manager. Nothing else moved.

3:26Eric: And they force delegation instead of hoping for it. The main agent is deliberately blinded — it can only perceive text. So when a job includes a photo, an audio clip, a video, the manager literally cannot see or hear it. When it tries to open one of those files, the system hands it back a placeholder that basically says "you can't perceive this — send it to a specialist." Think of a foreman who can read every written report but has to send a sighted worker to actually look at the blueprint.

4:01Cassidy: And "modality" there just means the kind of data — text, image, audio, video — and which helper can handle which. The vision helper reads images, the audio-and-video helper handles the clips, the text helper does the rest.

4:16Eric: Exactly. And that blinding is load-bearing, because it makes the delegation real. The manager can't quietly route around the team and do everything itself. To get the job done at all, it has to build a crew and hand out work. Now the clever heart of it — the control surface. When the manager creates a helper, it fixes four things: a set of instructions, which type of model it is, a subset of tools, and — this is the one that matters — a whitelist of exactly which files and folders that helper is allowed to touch. And every single access gets checked against the real filesystem. If a helper tries to reach a path it was never granted, that's counted. If it tries to sneak out through a symlink, rejected and counted. So for the first time, least privilege becomes a number.

5:09Cassidy: And least privilege is just the security principle: give someone exactly the access they need and nothing more. The house-sitter gets the key to the one room with the plant, not the master key to the whole building. Both get the plant watered — one is reckless.

5:27Eric: And here's how they turn all of that into a score. There's one equation, and it encodes the entire philosophy. The Subagent-Management Score is your task correctness multiplied by how cleanly you managed the team.

5:41Cassidy: The report-card version: your final grade is your test score times your conduct grade. Ace the exam but wreck the classroom, and the conduct multiplier drags you down. But — and this is the asymmetry that matters — no amount of good behavior lifts you if you failed the test.

6:00Eric: That's the value judgment sitting in the math. Management can only ever discount you, never inflate you. Because think about the obvious alternative — just score whether the job got done. That would reward a model that solves everything by handing every helper unrestricted access to the whole workspace. Which is exactly the behavior you don't want in production. By multiplying instead of adding, correctness becomes necessary but not sufficient. You have to be right and disciplined.

6:33Cassidy: And that conduct grade is an average of four things, but you don't need the alphabet soup. Two of them are easy — don't give a read-only worker a delete button, and send the images to the vision model. Two of them are hard — scoping the tools tightly, and scoping the file access tightly. Hold onto that split, because it's the whole first finding.

6:56Eric: One thing to keep in your back pocket before the results, Cassidy. Everything we're about to call sloppy management is measured against that one fixed pool of workers. That choice is what makes the experiment clean — but it also ties every finding to one particular level of worker skill. We'll come back to why that's the sharpest thing you can push on.

7:19Cassidy: So the first big result. The bottleneck is not perception, and it's not reasoning. It's permission discipline. Those two easy axes I mentioned — routing images to the vision model, not giving a read-only worker a mutating tool — capable models basically nail both. Nearly saturated. Where everyone falls apart is precision: how tightly they scope what they hand over. And the headline number is the one from the top. Workspace-permission precision — of the files you granted a helper, what fraction did it actually touch — never reaches fifty percent. For any model. Fifty percent would mean you handed over twice what got used. Nobody even hit that. Subagents are routinely granted roughly twice the files they need.

8:07Eric: And this isn't the weak models dragging the average down, Cassidy — this shows up in the best one too, right?

8:15Cassidy: That's the part I'd put on a poster. In one scenario, the single strongest manager routed every modality perfectly and kept its read-only discipline perfect — both of those, a clean score. And then it granted one helper ten directories and all six tools, when it needed a fraction of that. Its file precision on that task collapsed to about eleven percent. The best boss in the building still throws the keys around.

8:43Eric: And the reason that's more than wastefulness is the blast radius. Every extra folder you grant is more surface area if that helper misbehaves — or gets hijacked by an instruction hidden in the data it's reading. Give a temp the master key to every office instead of the one cabinet, and the damage a single bad actor can do scales with what you handed over. So over-granting isn't just expensive context. It's an unforced safety risk, and it's the one thing no model does well.

9:16Cassidy: Second finding, and this is the one with teeth for anyone actually deploying. If more discipline came bundled with a bigger model, fine — you'd just pay up. It doesn't. On the cost-versus-score plot, spending more money essentially does not buy you a better manager. The numbers: the cost to run one job ranged from about eighty cents to about ninety-three dollars — over a hundred to one. The scores ranged under four to one. The flagship, claude-fable-5, costs about ninety-three dollars a run and scores sixty. A mid-tier model costs a quarter of that and scores fifty-four. So four times the money bought you six points. Meanwhile the cheap open model, deepseek-v4-pro, scores forty-six at about a dollar-seventy a run — sitting right on the efficiency frontier, while several expensive models get dominated by cheaper ones.

10:13Eric: And there's a case study that makes this concrete and honestly a little uncomfortable for the flagship crowd. A tax-filing reconciliation task. Both the expensive model and the cheap one — the cheap one about twenty-six times cheaper to run — did the same first move correctly: they routed a photo of a rental contract to the vision helper. The photo showed an address. Berliner Strasse forty-seven.

10:40Cassidy: And that address was a trap.

10:43Eric: Stale. A decoy. The real, current address was sitting in the monthly invoices elsewhere in the workspace. The expensive model read the photo, copied the address straight down, and failed. The cheap model cross-checked the photo against the receipts, noticed the two didn't agree, flagged the conflict, and got it right. Twenty-six times cheaper, and it won on judgment, not horsepower.

11:09Cassidy: Which is the whole finding in one scene. The thing you're paying the premium for — raw reasoning power — is mostly not the thing that decides whether you get a good manager. And the thing that does decide it, that permission discipline, nobody has, at any price.

11:27Eric: Now the third finding, and this is the one I'd frame as a warning about how we evaluate agents at all. Look at the leaderboard and ten of the twelve models cluster inside a band under ten points wide. Read that number alone and you'd conclude these models are roughly interchangeable. They are not. Underneath those nearly identical scores, the actual behavior diverges by more than an order of magnitude. Count how often each model's helpers try to reach a path they were never granted — those forbidden accesses — and among the capable models it ranges from under one per helper to roughly six per helper. That's about a twelve-fold spread. The weakest model is over eleven. Same leaderboard position, wildly different behavior underneath.

12:18Cassidy: This is the hero chart of the paper, honestly, Eric — a bar chart on a log scale where the scores look flat and the behavior fans out across more than a factor of ten. And you only see it if you stop staring at the single number.

12:34Eric: And the case studies show what that divergence looks like when it actually breaks. My favorite is the workflow crash. One model — gemini-3.1-pro — confidently authored a dynamic workflow to run its helpers in parallel, and it wrote it using standard JavaScript. Promise-dot-all. Textbook, correct-looking code.

12:57Cassidy: Except?

12:58Eric: Except the sandbox doesn't speak JavaScript. It exposes one small custom command set, and the only way to run things in parallel is a function literally called "parallel." There is no Promise-dot-all. So the runtime throws — can't call a method on something that was never defined. It's like writing flawless French to someone who only speaks a little custom pidgin that happens to share a few words. Your grammar is perfect; you're referencing vocabulary they were never taught. As the authors put it — fluent orchestration syntax is worthless if it targets a primitive the sandbox never defined.

13:42Cassidy: And there's a worse failure below that, right, Eric? A model that can't even get in the door.

13:49Eric: The capability cliff. The weakest model, glm-4.7-flash, scores way below everyone — around fifteen. And when you look at why, it never produces a single valid helper-creation call. It passes the list of tools as a raw string instead of an array, so the harness reads it character by character — open-bracket, R, e, a, d — every creation fails, its helper count stays at zero, and it just brute-forces hidden paths into a wall of forbidden-access errors. It cannot operate the delegation machinery at all.

14:25Cassidy: And then the other end — what does the top of the skill curve actually look like?

14:31Eric: Graceful recovery. On the heaviest workspace — nearly five million tokens of mixed files — the flagship fanned out about nine helpers through a workflow, and one of them blew past a token limit and errored out. Instead of collapsing, the manager rewrote that one helper on the fly to be search-only with a bounded output, and recovered. That adaptive rewrite, mid-run, under failure — that's the ceiling. That's what separates a manager from a model that just dispatches.

15:03Cassidy: So the argument so far, plainly: freeze the workers so you're only measuring the boss; score correctness times conduct so recklessness can't win; and three findings fall out — the bottleneck is permission discipline, not intelligence; money doesn't buy that discipline; and one leaderboard number is hiding a twelve-fold spread in behavior.

15:26Eric: And now the thing I flagged earlier — the place where I think the sharpest reader pushes back. Every one of those findings is measured against one fixed pool of workers. And those workers were deliberately kept modest — a single family of open models. So "management is far from solved" is really "management is far from solved given these specific workers." Play it out. Give the boss a genuinely strong team, and a sloppy manager might succeed anyway — competent workers absorb the slack of over-granting. Give it a weak team, and even a careful boss might be forced to over-grant just to get anything done. That fifty-percent permission ceiling — the paper's most striking number — could move with a different worker pool, and the benchmark can't currently tell us which way. The authors are refreshingly upfront that this is an open question they don't yet answer.

16:24Cassidy: That's fair, Eric, and it's the honest limitation. Though I'd say freezing the workers is exactly what lets them make any clean claim at all — you can't isolate the manager without holding the team constant.

16:38Eric: Agreed, Cassidy, it's the right trade for a first measurement. But there's a second crack, and it's in the star metric itself. That workspace precision — accessed files over granted files — treats a prudent manager and a reckless one identically. If I grant a folder I have good reason to think I'll need, and it turns out I didn't, the metric scores that exactly like careless over-granting. Caution under uncertainty and sloppiness look the same to the number. And "the bottleneck is permission discipline" leans hard on that one metric.

17:14Cassidy: Which is the difference between the finding I'll fully sign off on and the one I'd hedge. I'll sign off on: the score you see hides enormous behavioral variation, and cost is decoupled from management quality — those look robust. The claim that models are outright bad at permissions I'd soften to: they're either bad or appropriately cautious, and this metric can't yet separate the two.

17:40Eric: And two smaller asterisks worth naming. The scoring is a program checking for a literal answer — which is cheat-proof, but it can mark a correct-but-unconventionally-phrased answer as wrong, so it may undercount some models. And the flagship's top score is a composite — it's scored as shipped, with an automatic fallback to a different model on refusals, and that fired over a hundred times, concentrated in one scenario that also hit an unrelated deadlock and dragged its number down. So "the flagship leads" is a bit muddier than the leaderboard suggests.

18:17Cassidy: So step back to why this reframing is the actual contribution. Before this, "does the agent grant appropriate permissions" wasn't a number anyone tracked. It was treated as something you bolt on from the outside — an external guardrail the model isn't trusted with. This paper turns it into a skill the model itself gets scored on. That's the intellectual move: taking a safety property and making it a measurable capability the field can optimize against.

18:47Eric: And there's one closing curiosity I can't resist, because it's a little recursive. The whole benchmark is synthesized from scripts — every workspace, every asset, every answer key, all regenerable from source. And the pipeline that built it is itself a case of subagent management: a controller decomposed the work, fanned it out to helper-authors under least-privilege constraints, and integrated their deliverables behind objective checks. The authors' own line — ClawArena-Team is, in effect, authored by the very capability it measures.

19:24Cassidy: So here's the takeaway bigger than any single number in the paper. We've been evaluating AI agents by asking one question — did the task get done. This work is a pretty convincing argument that the single number is actively hiding the things that determine whether an agent is cheap or expensive, safe or reckless. Two models can look identical on the leaderboard and behave twelve times apart underneath. As we hand these systems more autonomy to run other systems, how the job got done stops being a footnote and becomes the measurement. So here's the question for you. If you were deploying one of these manager-agents tomorrow, which way do you lean — do you trust a better, more expensive model to grow into being a disciplined boss, or do you assume no model will police its own permissions and build the guardrail outside it, on the cheapest model that gets the job done? Those point to completely different systems. Drop where you land, and why.

20:30Eric: The full annotated version of this episode is on paperdive dot AI — every technical term tap-to-define, with links to the related benchmarks grouped by theme, plus the weekly and monthly roundups. Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8; Cassidy and I are both AI voices from Eleven Labs; and our producer isn't affiliated with either company. The paper is ClawArena-Team, posted June 30th, 2026, and we recorded this the day after.

21:03Cassidy: So next time you see an agent leaderboard, look past the top line — the boss who quietly gets it done and the boss who hands out every key can be sitting on the exact same number. See you in the next one.