0:00Cassidy: There's a kind of doctor who has read every textbook on the shelf, twice, and will give you a fast, confident answer to almost anything — and never once stop to open the chart in front of them. Now hand that doctor a full set of reference tools, drug labels, interaction checkers, the works, and tell them the tools are optional. In this paper, the best version of that doctor — GPT-5 — reached for a tool on exactly one percent of treatment cases. And its accuracy went down. Quick heads up before we go further — this is an AI-made explainer, both voices included.
0:39Tyler: And the same paper builds a different agent, called ATHENA-R1, that uses a tool on every single case — and beats GPT-5 by about eighteen points on drug reasoning. Here's the part worth staying for: ATHENA-R1 has eight billion parameters. One of the models it beats, DeepSeek-R1, has six hundred seventy-one billion — roughly eighty times bigger. By the end you'll understand why the small one wins, and it comes down to one idea: it was trained to know when it doesn't know enough yet. This matters because the dominant bet in medical AI for years has been scale — cram more knowledge into a bigger model. This paper is a counter-bet: for the questions where the bottleneck is knowing what to go check, training the habit of checking beats raw size. And the thing standing in the way isn't capability. GPT-5 had the tools. It's that having the reference book and reaching for it are two completely different skills.
1:43Cassidy: So start with what's actually broken. A large language model stores its medical knowledge as billions of tuned numbers — it has, in a statistical sense, read mountains of medical text, and it'll produce a fluent recommendation in one pass. But that's recall: squeezing an answer out of frozen weights. It can't open a specific drug label and quote you the contraindication. Picking a drug for a real patient isn't recall — it's the "hold on, let me check the manual" move. She's pregnant. His kidneys are failing. The antibiotic that's perfect for the infection interacts dangerously with the blood thinner he's already on. The right answer for the disease is the wrong answer for the person. And the authors pin the hard part precisely: treatment reasoning requires knowing what evidence to seek before you can form a conclusion. That's the skill — not the facts, but the reflex of noticing, mid-thought, that you don't have enough yet.
2:41Tyler: But the obvious fix is right there, Cassidy — bolt the tools on. Give the model the drug databases, the interaction checkers, let it look things up. Why doesn't that just solve it?
2:52Cassidy: Because access isn't use. That's the one percent. They gave GPT-5 an optional library of medical tools, and on treatment cases it chose to call one on about one percent of them — and it scored below its own no-tool baseline. Half-used, the tools made it worse. And when they forced the issue, required a tool call, performance still didn't recover. The capability was sitting right there, unused. Knowing the reference exists doesn't give you the habit of reaching for it at the right moment — and that habit is the whole ballgame, the one thing no amount of extra parameters had taught GPT-5 to do. So what does the trained habit actually look like? This is the diagram to hold onto for the whole video — ATHENA-R1's reasoning graph. It runs as a loop: look at the case, figure out what's missing, pick a tool, run it, read the result, revise. Think detective, not quiz-show contestant. The contestant blurts the answer from memory. The detective proposes a suspect, checks the alibi, rules them out, follows the next lead. The library is two hundred twelve tools, but don't count them — group them. They're categories of question: what is this drug approved for, what are its contraindications, does it interact with these other meds, what does this disease map to, is it restricted in this population. And the sources behind them are real, maintained public databases — openFDA's drug labels, Open Targets, DrugBank.
4:25Tyler: Hold on — the diagram shows it forking into parallel branches and getting interrupted by a clinician mid-stream. Is that real, or is that a flowchart dressing up an ordinary straight-line chain of thought?
4:39Cassidy: It's real, and the worked case in Figure 1 shows it. Seventy-seven-year-old man, type-2 diabetes, early kidney disease, on metformin, an ACE inhibitor, and a diuretic. Watch the graph fork — ATHENA-R1 evaluates the drugs in parallel branches at once. One branch checks metformin's interaction profile and its lactic-acidosis risk. Another tests calcium-channel blockers, finds no metformin interaction, and that branch gets pruned — a dead end, dropped. Then a clinician interrupts: is the lactic-acidosis risk actually significant at this kidney function? The agent takes the interjection, pulls more evidence, and folds it in before concluding. And the plan it lands on is concrete — keep the metformin at its current dose, since it's safe above an eGFR of forty-five, but write in an explicit threshold to cut it if kidney function drops; keep the ACE inhibitor; swap the hydrochlorothiazide for indapamide; recheck every three to six months. The point isn't the answer. It's that every step is shown and sourced, and you can audit exactly why it got there.
5:54Tyler: And that trace is the whole thing, right? Because it's also what makes the training problem look impossible. To teach a model to produce traces like that, you need examples of traces like that — thousands of them, across every tool and every approved drug. Who writes those? Nobody. And that's the part that shouldn't work — so here's the densest stretch of the paper, and it pays off in a pipeline that builds nearly four hundred thousand worked examples with zero written by a human. The trap is circular: to generate good treatment-reasoning traces, you'd need a model that can already reason well about treatment — which is the thing you're trying to build. The escape is to not use one model. They turn a collection of specialized AI systems loose to construct the entire training universe from the ground up — first the tools, then the treatment tasks, then the step-by-step traces that solve them. It's a textbook author who writes the chapters, the practice problems, and the answer key, then hands the whole package to a student to drill. The numbers that fall out: roughly four hundred thousand instruction samples, from eighty-five thousand traces, with a hundred seventy-seven thousand reasoning steps and two hundred eighty-one thousand tool calls — grounded in every FDA-approved drug going back to 1939. None of it hand-written. Then it's trained in two levels, and the distinction matters. First, supervised fine-tuning on all those traces — that installs the shape of good reasoning, the structure of the loop. But structure isn't strategy. So the second level is reinforcement learning, run live inside the real two-hundred-twelve-tool environment: the agent tries cases, its attempts get scored, and it's nudged toward whatever scored higher.
7:47Cassidy: And this is where it diverges from normal training — what exactly are they scoring? Just whether it got the right answer?
7:55Tyler: No — and that's the core idea. Picture a math teacher who grades your shown work, not just the final number. Grade only the number, and students learn to guess and memorize. Grade the method, and they learn to solve problems they've never seen. ATHENA-R1's reward scores the whole trajectory on six dimensions — yes, correctness, but also whether it actually gathered relevant evidence, whether its tool calls were grounded in the right arguments, and whether it reasoned without redundant looping. It's rewarded for reasoning well, not for landing on the right multiple-choice letter. And that's the direct answer to the GPT-5 problem: you can hit the right letter by luck or pattern-match, but you only learn the habit of evidence-seeking if the habit itself pays. And you can watch each stage earn its place. The base model scores about thirty-nine percent. Supervised fine-tuning — learning the shape — takes it to sixty-six and a half. Then the reinforcement stage — learning the strategy — pushes it to nearly seventy-five. Two clean jumps, each from one thing. So it trains. The real question is whether a habit drilled on self-generated homework survives contact with anything real — starting with the model it's supposed to be smaller than.
9:20Cassidy: Let's take the size claim head-on, Tyler, because it's the one people will repeat. If the bet is right — that trained evidence-seeking beats crammed-in knowledge — then a small agent that looks things up should beat a giant one that recalls. On patient treatment selection, ATHENA-R1 scores 82.9 percent. DeepSeek-R1 — six hundred seventy-one billion parameters, eighty times the size — scores 67.5. The eight-billion agent wins by more than fifteen points. On the drug-reasoning benchmark the spread is wider: 94.7 percent, against GPT-5's 76.9 and DeepSeek-R1's 68.8. And the cleanest contrast is the off-the-shelf tool-using models — the ones already wired to call functions. They collapsed: one scored thirteen percent, another under six. Having a tool-calling model is not the same as having a trained reasoner. Same access, no habit. There's one case that captures the whole thing. A pediatric corticosteroid question: DeepSeek-R1 judged the drug safe. ATHENA-R1 pulled the actual FDA label and flagged a documented pediatric risk — suppression of the body's stress-hormone axis. The giant model guessed from memory; the small one looked it up. That's the eighty-times gap evaporating on a single retrieved fact. Then they put it in front of people. Two dozen-plus rare-disease experts, blinded, arena-style head-to-head — and the key comparison was against Qwen3-8B, the exact same base model ATHENA-R1 is built from, so any gap is the tools and the training, not a better underlying brain. Experts preferred ATHENA-R1 on whether they could follow its reasoning ninety-five percent of the time, and on the quality of its rationale ninety-four. Absolute scores: 4.16 out of five versus 2.44. And the biggest gaps weren't on getting the answer right — they were on traceability and rationale. What experts valued most was that they could see why.
11:31Cassidy: One thing to flag before it comes back to bite us. Those benchmarks — the drug and treatment tests — are generated from structured FDA labels. The same FDA labels ATHENA-R1's tools are built to query. So the retrieval skill maps almost perfectly onto the answer key, and that gap matters later.
11:50Tyler: Which is exactly why the next test is the one I care about. Clean benchmarks and blinded experts are one thing. Does the learned reasoning survive five and a half million real patients?
12:03Cassidy: This is the escalation — three layers, one question: does the reasoning hold up as you move from clean to messy, and at what cost in confidence? Benchmark, then expert, and now the real world. They had ATHENA-R1 generate adverse-event hypotheses for triadic patient profiles — a disease, plus a comorbidity, plus a medication. Then, crucially, they kept only the hypotheses where prior safety evidence was thin. So this tests whether the agent can generate something new, not just re-confirm what's known. And they checked each one against the electronic health records of 5.4 million patients at Clalit Health Services. The move that makes this credible is the negative controls. Before you trust any signal, you test the detector on things you know are nonsense — like a lie detector you first calibrate on questions whose answers you already have. They fed in deliberately absurd associations: beta-blockers causing insect bites, diuretics causing corneal abrasions, a diabetes drug causing gum inflammation. All of them came back flat — no signal, odds ratio essentially one. Meanwhile a known real effect — a class of blood-pressure drugs raising potassium in kidney patients — showed up clearly. The detector lights on real things and stays dark on nonsense. For the genuine hypotheses, the significant hits ran odds ratios from about 1.48 to 1.84. An odds ratio of 1.84 means the predicted harm showed up about eighty-four percent more often in the at-risk group than in the comparison group — even after statistically subtracting out age, sex, socioeconomic status, and how much the patient uses healthcare. The standout: ATHENA-R1 proposed that beta-blockers might contribute to acute kidney injury specifically in hypertensive gout patients, through a uric-acid pathway — a mechanistically specific guess. And in the records, that subgroup showed the elevated risk.
14:05Tyler: And this is where I want to pump the brakes, because that exact result is also the most exposed. Of six predicted associations, only three reached significance — a couple of the others had confidence intervals sitting right across one, meaning no real effect. And on the beta-blocker-and-kidney hit, the variables most likely to be the real culprit — how sick the patient already was, their baseline kidney function — are exactly the ones the demographic adjustments don't capture. Sicker patients get certain drugs and have worse outcomes anyway. The negative controls rule out finding signal everywhere. They don't rule out that. So let me give the whole result its strongest fair form, and then where it actually bends. The headline is real: an eight-billion agent, trained to seek evidence, beats models eighty times its size and convinces blinded experts. But three things keep me from reading the margins as gospel. First, the benchmarks lean toward the method by construction — Cassidy flagged it. The tests are built from FDA labels; the agent is built to read FDA labels. They patched the memorization worry by testing on 2024-approved drugs held out of training, which is genuinely good — but it doesn't close the deeper issue: when the evidence lives in clean structured labels, retrieval is almost the whole task. In a real clinic, a lot of what matters isn't in a tidy label.
15:50Cassidy: That one I'll concede flat out — the real-world gap is probably widest exactly where the evidence is messiest.
15:58Tyler: Second, the judge is also a competitor. The baselines were scored using GPT-5 to extract answers — and GPT-5 is one of the models being beaten. To their credit, they're transparent that it swings the number: ATHENA-R1 scores about seventy-five percent when it judges itself, versus eighty-three when GPT-5 judges. The lead holds either way, but an eight-point swing from the scoring protocol means treat the exact margins as approximate, not precise. And third — the part the authors say out loud — the system doesn't quantify its own uncertainty. It never says "I'm not sure." Which is the cruel irony: in the hard, ambiguous cases, the ones where their own physician reviewers disagreed most, "I don't know how confident to be" is precisely the thing you'd most want it to tell you. The agent learned to seek evidence. It hasn't learned to doubt.
17:03Cassidy: And the authors don't dodge any of it. They call ATHENA-R1 a research system, not a point-of-care tool or a risk calculator. The EHR analyses are observational — hypothesis-generating, not causal. And the whole thing only reads natural language: no imaging, no labs, no time-series. For a paper posting these numbers, that restraint is the tell that they know exactly where the edges are. But step back to what actually shifted here. For years the reflex in medical AI has been: make it bigger, cram in more knowledge. This paper's durable result isn't the agent or the pipeline — it's the counter-bet underneath them. For the questions where the bottleneck isn't how much you know but whether you know what to go check, an eight-billion model trained to look things up beat a six-hundred-seventy-one-billion model trained to recall. The win came from training a habit, not buying parameters. And the explanation it produces isn't a story it tells after the fact — it's the actual trail of evidence it pulled, which is a genuinely different object than the post-hoc rationales that gave explainable AI in medicine a bad name.
18:14Tyler: Which leaves the question this paper is really posing. If you had a fixed compute budget for the next clinical AI system, where would you put it — into a bigger model that knows more, or into training a smaller one to know what it needs to go find out? The whole field has been betting on the first. This is the sharpest case yet for the second. Drop a comment with which way you'd spend it, and why.
18:38Cassidy: The full annotated version is on paperdive dot AI — every term tap-to-define, the worked metformin trace, and links to the related papers by theme, from the reason-and-act loop this builds on to the work on training data that learns from itself.
18:54Tyler: Quick housekeeping: this script was written by Anthropic's Claude Opus 4.8, Cassidy and I are both AI voices from Eleven Labs, and the producer isn't affiliated with either company. The paper is "An AI agent for treatment reasoning over a biomedical tool universe," published June 27th, 2026; we recorded this a few days later. The detective put down the textbook and checked the chart — and it turned out that was the whole skill.