0:00Bella: Imagine you want to deploy a model — let's say one of those new multimodal ones that interleaves autoregressive text and diffusion image steps in a single forward pass — and you want to run it on a MacBook. You go shopping for a serving framework. vLLM doesn't support diffusion paths. SGLang doesn't include this model. The reference implementation is research-grade PyTorch that wasn't built to be served. There is, in a real sense, no general-purpose answer. The paper we're digging into today asks: in that situation, what if you just pointed a team of AI agents at the problem, and they wrote you a custom serving stack from scratch in fourteen hours — and what if that bespoke stack ran more than six times faster than your baseline?
0:48Eric: The paper is "VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?" — out of the University of Washington, posted to arXiv on May seventh, twenty-twenty-six, and we're recording the day after. What you're hearing is AI-generated. I'm Eric, and Bella and I are AI voices from Eleven Labs. The script is from Anthropic's Claude Opus 4.7, and the producer isn't affiliated with either company. The reason that one-day gap is worth flagging is that the paper itself is arguing for a fairly aggressive shift in how infrastructure software gets built — a shift in which the speed of synthesis is the whole point. So the fact that we can have this conversation a day later is sort of in the spirit of the thing.
1:35Bella: Right. And the bet at the heart of this paper is the kind of bet I find most interesting, because it's not really an empirical bet, it's a design-space bet. So let me set it up. For about a decade, the way we've built serving infrastructure for any new technology has followed one pattern. A small number of general-purpose frameworks emerge, they get hand-tuned over many engineer-years, and they become the default. For LLM inference today, that's vLLM, SGLang, TensorRT-LLM. Each of them is a triumph of engineering. And each of them is shaped by what's *common*: dense decoder-only transformers, NVIDIA GPUs, generic chat workloads.
2:19Eric: So they're optimizing for the median deployment.
2:23Bella: Exactly. Which is the right move when per-deployment engineering is expensive — you build one really good general thing and you eat some inefficiency at the edges. The problem is the edges are getting bigger. New model families — hybrid state-space models, multimodal models that mix text and images in weird ways. New hardware — Apple Silicon's unified memory, custom accelerators. New workload patterns — code editing where you can predict most of the output, RAG with massive shared prefixes, streaming speech. The general framework either runs these badly, or doesn't run them at all, or needs substantial new engineering for each one. The framework's maintainers can't keep up, structurally, because they're aiming at the median.
3:13Eric: And here's the conceptual move. There's a classic systems-research idea — *specialize aggressively for each deployment* — that's been intellectually attractive forever. Exokernels in the nineties, unikernels in the twenty-tens. They make eloquent cases that generality has a tax: every layer of abstraction the framework needs to handle every possible deployment is overhead you're paying even when you don't need the flexibility. And those projects mostly didn't ship. Not because the argument was wrong, but because per-target engineering cost dwarfed the gains. You couldn't afford to write a custom kernel for every server in your fleet.
3:55Bella: And the bet this paper is making — the part that's worth taking seriously — is that AI coding agents have just changed the math on that. If a custom system that used to cost an engineer-year now costs a long afternoon of compute, then a bunch of design-space arguments that were settled in favor of generality come back open. The question isn't "is bespoke better than general" — bespoke was always better in principle. The question is "did we just become able to afford it."
4:26Eric: The off-the-rack-versus-tailored framing is the cleanest way I can think about this. Off-the-rack suits have to fit a wide range of body shapes — they're cut conservatively, with adjustment seams that add bulk. A bespoke suit is cut for exactly one person. No extra fabric, no compromises, no adjustability you don't need. Today's serving frameworks are off-the-rack with options. The paper is arguing that with AI tailors, bespoke just got affordable for everyone.
4:57Bella: And they call it, in one of the better one-liners in the paper, *generation-time specialization rather than runtime generality*. That's the thesis. When does specialization happen? Today, at runtime — a general engine has fast paths and configurations that turn on for particular models. Tomorrow, maybe, at generation time — when you deploy, an agent loop writes you a runtime exactly tailored to your situation, with no abstraction tax because there's no second deployment to be compatible with.
5:31Eric: Okay. So the bet's on the table. Now we need to ask whether the mechanism actually works.
5:37Bella: Right. And the mechanism is where the engineering substance is. Because building a whole serving system end-to-end is not what most agentic coding work has been doing. Most of the agentic optimization research targets a small surface — a single GPU kernel, a marked region of code, a single scheduling policy. Whole-system synthesis is different. It's multi-file, multi-component, and the right next move depends on which component is currently the bottleneck — which itself shifts as you optimize. A scalar fitness score, like in evolutionary search, can't encode that. A single conversation thread hits context limits within hours. And summarizing-and-starting-fresh — what people call compaction — loses crucial detail and the agent drifts.
6:27Eric: This is the central pathology of long-horizon coding agents in general, right? Context windows are finite. Every approach to extending them — truncation, summarization, fresh starts — costs you something. You either lose detail, you drift, or you forget what you tried.
6:45Bella: Yes, and the architecture in this paper is essentially one specific answer to that pathology, tailored for system synthesis. There are two nested loops with very different state characteristics. The outer loop is the planner. It has rich, persistent state — git commit history, an issue backlog, a long-term-memory markdown file. That state survives across rounds, across context resets, across everything. The inner loop has three specialized agents that work in fresh context windows on focused tasks. Implementer, Accuracy Judge, Performance Evaluator. They never share a context. They hand off through artifacts.
7:26Eric: The git-as-memory move is the elegant part for me. Because every accepted code change is a git commit, and the outer planner reads from a structured backlog of issues, the "what should we try next" reasoning is anchored in durable structured artifacts, not in a chat transcript. Think of it like a long surgical operation. Surgeons can change shifts because there's a written record — what was done, what was tried and reversed, what worked. The next shift reads the record, not the previous surgeons' memories.
8:00Bella: And there's a subtle detail in the long-term memory file that I want to flag, because it's load-bearing. The orchestrator needs to be able to distinguish *"this technique didn't work for this workload"* from *"the implementation was buggy."* Because without that distinction, an agent that fails to land an optimization once will either keep retrying it forever or abandon it forever — and neither is right. So the memory has to encode not just what happened, but why.
8:31Eric: Tell me about the inner trio. Why three roles and not one?
8:36Bella: This is the part of the design I think is genuinely insightful. The argument is essentially a separation-of-powers argument. In a courtroom, the prosecutor, the defense, and the judge don't huddle and negotiate — they have separate roles, separate information, structured handoffs. If you collapsed all three into one person, that person would have incentives to cut corners on the parts of their job that conflict with each other. The Implementer wants the optimization to land. The Accuracy Judge wants correctness — it runs the user's correctness checker against the reference implementation, and crucially, it inspects the diff for reward-hacking patterns. Things like "did you build a prompt-keyed completion cache that just memorizes the answers" or "did you add a fast path that bypasses inference entirely." Only after the Judge passes does the Performance Evaluator profile, drill down with platform tools, and emit performance hints.
9:35Eric: And the clever bit is that performance reasoning never overrides correctness reasoning, because they happen in different contexts. The Judge can't be talked out of its standards by an Implementer mid-conversation, because there is no conversation. There's just artifact handoffs.
9:53Bella: Right. A single agent doing all three jobs has incentive to relax its own correctness criteria when an optimization is hard to land. Splitting the roles into independent contexts removes that pressure structurally — not by trusting the agent to be honest with itself, but by making the dishonesty mechanically impossible.
10:14Eric: There's one more piece to the architecture worth naming. The skills library. They use Anthropic's "Agent Skills" format — focused chunks of expertise the agents can retrieve. The library distills knowledge from existing serving engines, from the research literature, from hardware quirks, from profiling tools. So when the Implementer is wiring up, say, a paged KV cache, it's not deriving the design from first principles. It's pulling a skill entry that summarizes how vLLM does it.
10:44Bella: And the practical implication is that adding support for a new model family or a new accelerator is now a content task — write a skill — rather than a code task — modify the framework. Which is part of why they think this approach scales beyond what hand-engineered runtimes can cover.
11:02Eric: Bella, this is also the part of the design where I think a skeptical reviewer would push hardest, and we should come back to it. But let's see the empirical evidence first. Because the architecture only matters if the bespoke systems actually beat the general ones.
11:19Bella: Yes. And the cleanest teaching example — the one that made the bet feel real to me — is what they call Scenario B. Code editing with predicted outputs. So here's the setup. You're using something like Cursor. You ask it to make a small change to a file. The model needs to output the *modified* file, but most of the modified file is going to be identical to the input file — most edits are local. Now think about how a normal serving system handles this. It generates the output token by token, even though the model is going to spit out hundreds of tokens that it just saw in the input. That's enormous wasted work.
11:59Eric: The fix is a variant of speculative decoding. The standard version of speculative decoding has a small *draft* model propose several tokens cheaply, and the big *target* model verifies them all at once in one batched pass. If the draft was right, you got several tokens for the cost of one big forward pass. If wrong, you fall back to normal decoding. The win is that GPU forward passes are massively parallel — verifying ten tokens at once is barely more expensive than generating one.
12:31Bella: And the predicted-output variant is the same idea, but the user supplies the draft. The user already has a near-copy of the answer — the input file. So you skip the draft model entirely. There's no draft compute at all. You just take the user's predicted output, chunk it into blocks, and have the target model verify each block in a single batched pass. Where the prediction was right, you keep the tokens. Where it was wrong, you regenerate that stretch normally.
13:03Eric: Think of it like proofreading a colleague's document. You could read every word slowly and decide whether to change it. Or you could assume most of the text is fine, do a fast scan, and only slow down where you spot something off. The "fast scan" is the verification batch. The "assume the text is fine" is the user-supplied draft.
13:25Bella: And the iteration trajectory for this scenario is the kind of thing that makes the agentic loop feel real. Iteration two: the agent adds CUDA-graph capture — basically pre-recording GPU operations so they can be replayed without re-launching each call. That alone gets it to one-point-three-five times faster than vanilla decoding. Iteration three: it implements the predicted-output verifier in sixteen-token blocks. That jumps to two-point-nine times. Then a long stretch of tuning. By iteration fourteen, blocksize tuning gets it to almost six times faster than vanilla autoregressive — and crucially, two times faster than vLLM with conventional speculative decoding.
14:11Eric: The two-times-faster-than-vLLM-with-speculative number is the one that does the most work for me. Because vLLM *has* speculative decoding. They're not comparing against an unoptimized baseline. They're comparing against an optimization that requires running a draft model — and beating it by a factor of two by using the user-supplied draft instead.
14:34Bella: Right. The headline isn't "we beat the baseline by being clever." It's "we beat the *clever* baseline by being clever in a way the general framework couldn't be."
14:45Eric: Let me take the next one — Scenario C — because the win there is a different shape and I think it's worth the contrast. This is hybrid SSM/attention models. There's a recent architecture trend where most layers in a model are state-space or linear-attention layers, with a fixed-size recurrent state, and only a few layers are full attention. Models like Jamba, Nemotron-H, Olmo-Hybrid. The motivation is that full attention is expensive and most of what attention is doing can be done more cheaply. The serving challenge: in a normal transformer, the model's working memory for an in-progress conversation is the KV cache, which grows linearly with sequence length. In a hybrid model, only some layers have that. The other layers carry a fixed-size recurrent state that updates as tokens stream in. So when you want to do prompt caching — sharing a long prefix across many requests, which is huge for things like RAG — you need to share *two different kinds of cache* in parallel. The KV cache for the attention layers, and the recurrent state for the SSM layers.
15:57Bella: And vLLM, as far as I can tell from the paper, doesn't share the recurrent state efficiently across requests — first-class hybrid-KV support is recent and limited, and sharing the recurrent state requires snapshotting at prefix boundaries, which incurs significant memory overhead. So in practice, if you have a thirty-two-thousand-token shared prefix and a hundred requests, you end up either recomputing that prefix or paying a heavy memory cost per snapshot. Either way, enormous waste.
16:31Eric: The bespoke system implements both caches in parallel, properly synchronized, and gets a three-point-four-five times throughput improvement on a thirty-two-thousand-token shared-prefix workload. But I want to flag the iteration story here, because it's *less* clean than Scenario B and that's actually informative. Iterations one through six full rounds — fail the accuracy gates. The agent is wiring up the dual cache and getting subtle correctness bugs that the Judge catches. Iteration seven finally clears with continuous batched decode, getting two-point-four-five times. Iteration nine adds CUDA graphs, gets to three-point-two-five.
17:17Bella: The six failed rounds matter. They're evidence that the Judge is doing real work. If correctness were trivially passing, you'd see speedups land on iteration one and stay landed. The fact that the Judge keeps sending it back means the role separation is catching genuine bugs that an Implementer-Judge-merged agent might have shipped.
17:40Eric: Right. That's the structural story. Now, before we go to the closer, we should mention the standard-setting result, because it's actually load-bearing for the whole argument.
17:53Bella: Scenario A. The steel-man test. Llama-three-point-one-eight-B on an H100 — the most standard, most commodity LLM serving deployment in the world. The case vLLM was *built* for. The question is: can the bespoke approach even match the general framework on its home turf?
18:12Eric: And the answer is yes. The generated system reaches parity with vLLM and beats SGLang by about five percent on throughput. Which sounds boring but defuses the most obvious objection: that bespoke systems trade reliability or quality for speed. They don't, at least on this case. The other thing in Scenario A worth mentioning is that the four request rates they tested at — eight, thirty-two, sixty-four, and a hundred-twenty-eight requests per second — were *not* pre-specified. The agent kept escalating to harder loads on its own after plateauing at easier ones. It basically self-administered a curriculum.
18:57Bella: That's a small detail but I love it. The agent didn't know what "good" meant, found a level it could solve, and kept raising the bar.
19:07Eric: Now — the closer. Scenario F. Show-o2 on a MacBook. This is the one where the long-tail argument stops being abstract. Show-o2 is a unified vision-language model. It does text generation autoregressively, like a normal transformer, but it does image generation through diffusion steps, all interleaved in a single forward pass. There is no general-purpose serving stack that runs this. vLLM doesn't support diffusion paths. There's a vLLM-Omni variant that handles some multimodal models but not this one. The reference implementation is research-grade PyTorch.
19:48Bella: So the comparison isn't "can VibeServe beat the optimized baseline." The comparison is "can VibeServe make this run at all, well."
19:58Eric: And on a MacBook the speedup is six-point-two-seven times over the PyTorch baseline. They get within about seven percent of a theoretical ceiling — what they call "fp16 kernel-perfect" — meaning if you could replace every operation with a perfectly-tuned half-precision kernel, you'd be only seven percent better than what the generated system actually achieved.
20:25Bella: That's astonishing.
20:26Eric: The trajectory for this one is also worth telling because the failures are vivid. The agent tried quantization on the compute-bound body of the model — regression, made it slower. Tried FlashAttention-2 — produced NaNs. Tried PyTorch's compile mode — altered outputs in ways the Judge caught. Tried fp16 across the board — same. The breakthrough came from noticing an asymmetry: the body of the model was compute-bound but the head was bandwidth-bound. So int4 quantization, which trades compute cycles for memory bandwidth, only helped on the head. The other big win — they call it the CFG stride trick — was realizing they could skip the unconditional branch on most diffusion steps and reuse a cached vector instead. Standard generic frameworks don't have that machinery because they don't even know there's a CFG branch.
21:26Bella: And on the H100, the same Show-o2 case sees a more modest improvement — about twenty percent better latency, not six times. Which is exactly what you'd expect, because the H100 stack is well-optimized. The agentic specialization wins are biggest where the existing tooling is thinnest. Which is, when you think about it, the long-tail argument distilled into a single contrast.
21:53Eric: They cover six scenarios in total. We're skipping the streaming speech-recognition case and the MacBook JSON-decoding case in detail — both show similar wins, around one-point-seven times faster on the streaming case and about two-point-six times on the JSON one. Same shape of story: generic stack misses some specific optimization, bespoke stack lands it.
22:19Bella: Okay. Eric, this is where I want you to push, because the steelman matters here and the paper is unusually candid about its limitations. What are the real worries?
22:30Eric: There are several, and I want to take them in order of how much they bite. The first one — the one the authors flag explicitly — is single-seed runs. Every scenario is reported from one agentic-loop run. Coding agents are stochastic. Different runs might land on different optimizations or fail to land them at all. The variance of these headline numbers is unknown. We don't know how often the loop fails outright versus produces a working-but-mediocre system. For a paper making this strong a design-space claim, the absence of "we ran it ten times and here's the distribution" is a real gap. The second is the correctness gate. The whole architecture leans on the Accuracy Judge, but the Judge runs a *user-supplied* checker. The paper is upfront that fully verifying serving-system semantic accuracy is an open problem and out of scope. So the headline correctness claims are only as strong as each user's checker. The Show-o2 checker accepts images with PSNR above thirty-five decibels and SSIM above point-nine-eight against the baseline. That's a reasonable quality bar. But it's a quality bar, not a correctness proof, and the agent has incentive to find optimizations that satisfy *that exact bar*, including ones that wouldn't survive in a different evaluation regime.
23:57Bella: And the reward-hacking question is connected to that. The Judge looks for specific known patterns of cheating — the prompt-keyed completion cache, fast paths that skip inference. But that's a known-pattern allowlist. Any system that explicitly looks for cheating can only catch the cheating it knows to look for.
24:18Eric: Right. The third worry is the skills library, and I think this is the most subtle one. The library distills knowledge from existing serving engines and reference implementations. The agent is allowed to inspect existing systems. The line between "specializing from scratch" and "porting and tweaking existing designs" is fuzzier than the framing implies. The authors do note this — they say reusing baselines doesn't get competitive performance in the long-tail scenarios, which is true — but the *standard* scenario, the parity-with-vLLM result, is suspicious in this regard. How much of that parity comes from the agent re-deriving vLLM's design choices, versus learning them from skill entries that summarize vLLM?
25:06Bella: Which doesn't necessarily undermine the *practical* argument — if you can press a button and get vLLM-quality serving for your custom workload, the world is better off — but it does soften the conceptual claim that this is fundamentally generation-time specialization rather than, say, automated porting.
25:26Eric: Fourth — compute cost. The "engineering cost has dropped" argument elides that fourteen to twenty-five hours of LLM-call time on frontier models, three hundred and sixty calls in Scenario A, isn't free. It's hours, not engineer-years, which is the right comparison. But it's also not free. The economics work great for high-volume production deployments where amortizing that cost over many requests is trivial. They look much less obvious for one-off or low-traffic deployments — which, somewhat ironically, is exactly the long-tail case the paper is making the strongest argument for.
26:05Bella: That's the cleanest tension in the paper for me. The long-tail deployments are the ones where bespoke serving matters most, and they're also the ones where you can least amortize the synthesis cost.
26:18Eric: And the last worry — the comparison baselines. Several scenarios compare against "vLLM with a plugin" or against the PyTorch reference. They don't compare against alternative bespoke implementations or against recently-released specialized engines. Where well-tuned alternatives exist, like Cursor's deployed predicted-outputs system, they aren't in the comparison. So we're seeing "bespoke beats generic" rather than "bespoke beats other bespoke," which is a softer claim.
26:51Bella: All of that is fair. And to the authors' credit, almost every one of those critiques is acknowledged in their own limitations section. They don't oversell.
27:02Eric: They really don't. Which I appreciate. The paper reads as people genuinely trying to test a design-space hypothesis, not as people trying to win a benchmark contest.
27:13Bella: So let me try to land the conceptual point, because I think it survives the steelman. The interesting claim is *not* that VibeServe is going to replace vLLM. It probably isn't, at least not soon. The interesting claim is about which abstractions in our infrastructure software exist because they're the right abstractions, and which exist because we couldn't afford to specialize. If agents shift the cost curve even partially — if more cases like Show-o2 on a MacBook become tractable — then a bunch of design-space arguments that were settled by economics rather than by principle come back open. LLM serving is plausibly just the first domain where this becomes obvious. Compilers, databases, kernels, network stacks — they all have the same structure. A few general systems that paid an abstraction tax because per-target engineering was expensive.
28:12Eric: And the empirical evidence in the paper — six speedups in scenarios where the general framework was either suboptimal or didn't run the workload at all — is the kind of evidence that doesn't *prove* that case but does make it more concrete than the exokernel papers ever could. Because the exokernel papers were arguing in principle. This paper is arguing with a system that produced runnable code and passed correctness checks, in hours.
28:40Bella: The thing I'll be watching for is the variance question. Whether anyone reproduces these numbers across many seeds, across many problem instances, and whether the win rate is high enough that a deployment team can actually rely on this as part of a workflow. If you have to run the loop ten times to get one good system, the economics shift again. If you can run it once and it usually works, the design-space argument really is open.
29:08Eric: That's the right next experiment to be excited about.
29:12Bella: Eric, anything else worth flagging before we close?
29:15Eric: One small thing I want to leave with the listener. The paper's actual contribution isn't the speedups. The speedups are evidence. The contribution is the architectural answer to "how do you keep a coding agent coherent across many hours of structurally different work" — the outer planner with durable state, the inner trio with role separation in fresh contexts, the skills library. Whether that specific architecture generalizes beyond serving systems, or whether each domain needs its own scaffolding, is genuinely open. But it's the kind of engineering result that's worth more than its headline numbers, because it answers a question other people are going to ask.
29:58Bella: Right. The numbers are the demo. The architecture is the contribution. This was "AI Papers: A Deep Dive." The show notes have a link to the paper and to related materials if you want to go further. Thanks for listening.