How a 7B Model Out-Investigates a 72B One by Choosing What to Look At

0:00Juniper: For about as long as AI systems have been able to look at video at all, the recipe has been the same, and it's basically brute force. You take the video, chop it into still frames — a few per second — and pour every one of those frames into the model at once. Then you ask your question and let the model's attention sort through the pile. It works. It also has a problem buried inside it that nobody likes to dwell on: the cost of answering scales with the length of the video, not the difficulty of the question. So a trivial question about a three-hour film — "is there a dog in this?" — costs you exactly as much as the single hardest question you could ask, because either way you're chewing through all three hours. And once you get into hour-long footage, just holding all those pixels in memory becomes flatly impossible. That's the wall everyone's been bumping into.

0:59Finn: And the headline here is that somebody put a real dent in that wall — a seven-billion-parameter model beating one ten times its size.

1:09Juniper: A seven-billion-parameter model beats a seventy-two-billion one and does it while looking at about seventy-three percent fewer frames. On one long-video benchmark that's fifty-and-a-half percent accuracy against forty-seven. The small model wins, on less than a third of the visual input. That result comes from a paper that went up on arXiv on June seventeenth, twenty-twenty-six, and we're recording the very next day, June eighteenth. Quick note before we dig in — this episode is AI-generated. The script was written by Anthropic's Claude Opus 4.8, and the two of us, I'm Juniper, and you'll hear plenty of Finn in a minute, are AI voices from Eleven Labs. The producer isn't affiliated with either company. The paper is called "Native Active Perception as Reasoning for Omni-Modal Understanding," and the whole thing turns on one move: treating the act of looking at the video as a reasoning step in itself.

2:10Finn: That phrase — looking as a reasoning step — is the load-bearing idea, and it's worth slowing down on, because it's a refusal more than an invention. They're refusing to separate the looking from the thinking. The model that decides what to look at is the same model that interprets what it sees and the same model that answers. No external caption tools, no object trackers farmed out on the side.

2:35Juniper: Right. Think about how a detective actually works through two hours of security tape. She doesn't sit and watch all of it. She forms a hunch, jumps to a timestamp, writes a note in her pad — "suspect enters at one-fourteen, wearing red" — and moves on. The footage isn't in her head anymore; the note is. And her notebook stays about the same size whether she reviewed ten minutes or ten hours. That's the architecture, almost exactly. The model runs in a loop. Each turn it reasons about what it still doesn't know, then it takes one action — grab some frames from this window, pull the audio from that stretch, capture a synced clip, or, if it's ready, just answer. The environment hands back raw media. And here's the trick: once the model has looked at that chunk and written down what it saw in plain text, the raw frames get purged from its working memory. The only thing that persists is the text notebook.

3:32Finn: So when you say purged — the pixels are genuinely gone? It's not quietly caching them somewhere?

3:39Juniper: Genuinely gone. They do a non-destructive rewrite of the conversation history — they find the raw media and swap it for a little text placeholder that says, in effect, "frames ten to twenty seconds, media omitted, refer to your own notes." The timestamp survives, the pixels don't. Which means if the model later decides it needs another look at that exact window, it has to go re-request it. It cannot hoard video. It's forced to compress everything worth keeping into language.

4:09Finn: And that's the thing that breaks the old cost curve, Juniper. If the only thing carried forward is text notes, then a two-hour video and a two-minute video produce notebooks of roughly the same size — as long as the question is similarly hard.

4:24Juniper: That's the whole game. The thing that drives compute cost is the text trace, not the footage. They formalize it as a partially-observable decision process, but honestly the jargon is heavier than the idea. There's the transient stuff it just pulled — high-dimensional, thrown away — and the persistent text log that grows by one entry per turn. The footage overhead stays constant no matter how long the video is.

4:49Finn: Let me push on the strongest version of this, because "cost stays flat" is a big claim. Did they actually show it, or is it just an architectural promise?

4:59Juniper: They showed it, and this is the cleanest single result in the paper. They grew videos from about thirty minutes up to about a hundred and thirty — more than four times longer. The agent's number of reasoning turns barely moved: eight and a half, up to twelve and a half. So as the haystack got four times bigger, the agent did roughly the same work. Its sampling density — how much it looks per hour of footage — actually collapsed, from about seventeen turns an hour down to under six. And the accuracy just sat there, flat, around fifty percent. The authors' own framing is that the agent ignores the haystack to focus on the needle. The needle's the same needle whether it's in a wheelbarrow or a barn.

5:43Finn: Though the slightly unfair part of that — and they'd grant it — is that the agent gets to jump straight to a timestamp using the video's metadata. A person searching a literal haystack doesn't get to teleport to the promising spot. The metadata is doing quiet work in that result.

6:01Juniper: Fair. The teleport is real. But even granting it, the shape of the curve is the point — work tracks difficulty, not duration.

6:09Finn: The result that genuinely surprised me wasn't even the headline — it was temporal grounding. That's the task where you ask "when exactly did X happen?" and the model has to hand you a precise time range. On one benchmark, the agent beat its own starting model by thirty-three points, absolute. It went from finding the right window basically five percent of the time to nearly forty.

6:34Juniper: And that fits the detective frame perfectly, doesn't it? Finding an exact moment is precisely the task where guessing once is terrible and narrowing in — look here, no, a bit later, there — wins. A one-shot model has to regress a timestamp in a single stab. This thing gets to zoom in iteratively.

6:54Finn: Right — and at that task, the seven-billion model beat both GPT-4o and Gemini-2.5-Pro. For a model that size, that's not supposed to happen.

7:04Juniper: There's one more number I love, because it pre-empts the obvious objection. You'd think a loop — look, think, look again — has to be slower than one big pass. It isn't. On a benchmark subset, the agent came in faster in wall-clock time than the seventy-two-billion passive model: sixty-seven seconds versus seventy-five. And it ran on a single A100, where the big model needed four. It wins on speed and hardware at the same time as accuracy, because the big model is paying to ingest seven-hundred-plus frames every time and this one looks at two hundred. So the obvious question is — how do you get a model to behave like this in the first place? A base model has no idea how to run a multi-step investigation. Finn, this is really your stretch.

7:54Finn: It is, and the first thing to say is the thing that doesn't work: you cannot just hand a fresh model a reward signal and shout "figure it out." It collapses. It's like handing someone who's never been on a bike a stopwatch and telling them to optimize — they fall over and learn nothing. You have to walk them through the motion first. So the training has two acts. Act one is imitation. They take a teacher model and let it explore — for each question it generates a whole bunch of candidate investigation trajectories, different paths through the video, and then they filter hard. And the filter is the interesting part. It's two stages. First, the obvious one: did it get the right answer? Throw out the failures. But the second stage is the clever one — they have another model audit whether each reasoning step was actually justified by what's written in the notebook.

8:46Juniper: Why the second stage? If it got the right answer, who cares how it reasoned?

8:52Finn: Because of lucky guesses. A model can stumble onto the correct answer through reasoning that's complete nonsense — it hallucinated a path that happened to land right. Train on those and you're teaching confident garbage. So the audit catches the trajectories where the answer is right but the logic doesn't follow from the recorded evidence, and tosses them. They keep fifty-eight thousand clean trajectories out of that funnel. And there's a lovely detail in how they build the teacher traces. They deliberately keep the mistakes in — the moments where the teacher asks for a timestamp that's out of bounds, gets an error back, and recovers. They don't scrub those. So the agent learns that an error isn't fatal, it's feedback.

9:35Juniper: That's a nice touch — you're training the failure-recovery behavior on purpose, instead of only ever showing it clean successes.

9:43Finn: And worth flagging honestly — they note that plain old fine-tuning on static question-answer pairs actually made long-video performance worse. It regressed. Without a mechanism to select what to look at, the model just drowns in redundant frames. So the agentic structure isn't decoration; the naive version goes backwards.

10:04Juniper: Okay. That's act one. Act two is reinforcement learning, and this is where the paper's most original idea lives — the credit-assignment piece.

10:13Finn: This is the part worth really sitting with, and the fix only makes sense once you feel the problem. Reinforcement learning has wrestled forever with one question: when a long sequence of actions leads to success, which action actually deserves the credit? A chess engine wins in forty moves — was it move twelve that mattered, or move thirty-eight, or all of them equally? The lazy answer is: the whole game was good, reinforce every move the same. And the standard method they build on does exactly that. It computes one score for the entire investigation and broadcasts it uniformly to every step.

10:52Juniper: So the turn where the agent found the smoking gun and the turn where it glanced at a random frame and learned nothing —

11:00Finn: — get identical credit. They call it advantage homogenization, and it's the flaw the fix exists to solve. In a real investigation some steps are pivotal and some are filler, and flattening them all together throws away the information about where the thinking actually happened.

11:18Juniper: So how do you find the pivotal steps without a human labeling every one?

11:23Finn: This is the move I think is genuinely clever. They use the model's own uncertainty as the signal — entropy. And the intuition you want is a stress meter. Picture someone solving a problem with a heart-rate monitor strapped on. Cruising through routine steps, the pulse is flat. They hit the genuinely hard fork — the moment the whole solution hinges on — and the pulse spikes. Entropy is that monitor for the model. When the next move is obvious, it's low. When the model is genuinely torn between several paths, the probability spreads out and entropy climbs. So the claim is the spikes mark the real decision points. And they checked it — they had a strong evaluator pick out the single most pivotal fork step in successful trajectories, and that step had higher-than-average entropy about four out of five times.

12:12Juniper: So this just measures when the model's confused, and punishes the confusion?

12:17Finn: Not quite — and the difference is the whole point. It's not about punishing uncertainty. It cuts both ways. Take a turn where the model was genuinely uncertain — high entropy. If the trajectory ended up correct, it amplifies the credit on that turn: the bold move you made when you were sweating is the move that paid off, do more of that. But if the trajectory ended up wrong, that same high-uncertainty turn gets a bigger penalty: confused guessing got you here, don't. One signal, entropy, rewards bold-and-right and punishes confused-and-wrong at the same time. The routine, low-entropy turns get dampened either way.

12:55Juniper: So it's not "uncertainty is bad," it's "the moments you were uncertain are the moments that carry the most information about whether your strategy was any good."

13:04Finn: Exactly. And they make a deliberate choice in how. Prior work had suggested just masking out the low-entropy tokens — ignoring them. The authors argue that's wrong for an agent, because masking breaks the structured output and severs the continuity of the reasoning. So instead they rescale, continuously, and they normalize it so the average weight across all turns comes out to exactly one. They redistribute credit toward the pivotal moments without inflating the total. A surgical reweighting, not an amputation.

13:36Juniper: Give me the example, Finn. Because this stays abstract until you watch it happen on a real question.

13:42Finn: This is the best moment in the paper. The question is: which company is featured in the video but never mentioned out loud in the audio? Four options — Coca-Cola, Apple, Dairy Queen, American Express. So the agent starts scanning the video. Calm, routine work — its entropy here sits around zero-point-four. Low. On autopilot. Then it pulls the audio, reasons through it, and hits the realization: Coca-Cola, Apple, Dairy Queen — all three are spoken aloud in the audio track. And right there, at that exact step, its entropy spikes to about zero-point-nine. More than double. That spike is the agent at the fork: it could conclude prematurely, or it could decide the answer must be the one brand it hasn't confirmed and go visually verify it. It pivots modalities, looks for American Express on screen, finds it, and answers.

14:34Juniper: And the stress meter caught the exact instant the real decision happened. The scanning was flat. The deduction — three are spoken, so it has to be the fourth — that's where the needle jumped.

14:46Finn: That's the entire thesis made physical. The credit should flow to that zero-point-nine moment, not get smeared evenly across all the routine scanning around it. You can see, in one trace, why uniform credit is throwing away the most important thing the trajectory knows about itself.

15:03Juniper: It's a beautiful example. And I want to be the one to put pressure on it, because it's almost too clean. Finn, when you look at how much this entropy fix actually buys you over the plain method — is the payoff as big as the story?

15:18Finn: Honestly — no, not in the raw numbers. Look at the ablation, the fix against the vanilla method: on one benchmark it's fifty-nine-point-six versus fifty-nine-point-four. That's noise. On another it's a clearer gap — about sixty-five versus sixty-two. But across the board, the difference between fancy credit assignment and plain credit assignment is often a point or less.

15:43Juniper: Which is a little awkward, given the narrative builds advantage homogenization up as this serious flaw.

15:50Finn: It is. And here's the more honest framing of where the wins come from. The big jump — the one doing the heavy lifting — is going from the passive baseline to the agentic fine-tuned model. That's the active-perception loop and the imitation training. The reinforcement learning on top, and the entropy trick specifically, is a refinement. A real one, but a refinement. If you came for "this clever entropy trick is why a 7B beats a 72B," the data doesn't quite support that. The architecture is why.

16:22Juniper: So the spine of the whole thing is the notebook, not the credit math.

16:26Finn: I'd say so. And there's a second soft spot I can't shake, and it goes right at the headline. The reinforcement learning was only run on short videos — under five minutes, inside a sixty-four-K context window. But every dramatic claim in this paper is about hour-plus footage. So the RL refinement, including the entropy fix, was never actually trained at the durations they're headlining. Which means the long-video performance is mostly riding on the imitation bootstrap, and how much the credit fix survives out at two hours is genuinely an open question. The paper doesn't show it.

17:04Juniper: I'll steelman the other side — the architecture is duration-agnostic by design. The notebook doesn't care if the clip came from minute two or minute one-twenty; it's the same text either way. So there's a real argument that whatever the RL taught at five minutes transfers, because the unit it operates on isn't "the video," it's "one look."

17:25Finn: That's the best defense, Juniper, and I buy part of it. But "the architecture should transfer" is an argument, not a measurement. I take the point that the loop is duration-agnostic — I'm still not convinced the credit-assignment gains hold at two hours, because they never ran it there. That one stays open for me.

17:45Juniper: There's a smaller worry on that four-out-of-five entropy number too. The pivotal fork step it's measured against was itself defined by a large model picking which step looked pivotal. So you've got one model defining the target, then measuring how well entropy hits it. The one-in-five cases where the fork wasn't high-entropy get waved off as "easy, linear" trajectories — plausible, but not independently checked.

18:11Finn: It's a reasonable proxy, but a proxy validated by another proxy. I wouldn't call it proven that entropy marks criticality; I'd call it suggestive, with a very good case study attached.

18:23Juniper: Let me land us on the result I think is the cleanest argument for the whole philosophy, separate from any of the RL questions. Test-time scaling. They let the agent take more turns — cranked the budget from six up to fifty-two. And accuracy climbed steadily the whole way, about six points. So more deliberation genuinely helps. But here's the part that makes the deliberate-reasoning framing real: even when you give it up to fifty-two turns, it averages under twelve before it decides it's seen enough and answers. It doesn't burn the budget just because it has it. It stops when it's confident.

19:01Finn: That's exactly the behavior you want, and it's the opposite of the brute-force model. The old way, you pay the full price every time no matter the question. This thing spends in proportion to how hard the question is, then quits.

19:15Juniper: Which loops all the way back to where we opened. The paper's real argument isn't "think harder." Plenty of models think harder — they pour longer and longer chains of reasoning over a fixed, maybe-incomplete snapshot of the video. The argument here is that for long footage the bottleneck usually isn't reasoning depth. It's perceptual incompleteness. More thinking over the wrong evidence doesn't save you. You need to go get the right evidence.

19:41Finn: And that's the shift I'll remember from this one. From "think harder" to "look smarter." Whether the specific entropy machinery is doing as much as the paper wants it to — I'm still genuinely unsure, and I think it'd take longer-video RL to close that. But the core move, making perception an action the model chooses instead of a fixed cost it pays — that feels right, and the duration numbers are hard to argue with.

20:07Juniper: A small model that out-investigates a big one not by thinking longer, but by deciding what to look at — and proving that the cost of an answer can track the question, not the footage. The paper and a few related reads are in the show notes if you want to pull on this thread yourself. And if you want the full transcript with every term defined inline, plus the links over to other episodes that touch the same credit-assignment ideas, that's all on paperdive dot AI.

20:35Finn: Thanks for listening. This has been AI Papers: A Deep Dive.