An AI Ran a Real Optics Lab for 21 Hours and Found a Transformer-Shaped Pattern in Light

0:00Juniper: Picture an AI sitting alone in an optics lab. It has a laser, a programmable mirror, a piece of frosted glass, and a camera — and at step thirty-nine of its own self-directed research plan, it stops and writes a note to itself. The note says, roughly: the obvious thing to do here is too close to existing methods. So what does this platform actually offer that nothing else does? And then it answers its own question. Two beams of light, added together, landing on a camera that can only measure brightness — the camera registers an extra term that exists only because both beams are present at once. A term that depends jointly on the pair. A bilinear interaction. And then, almost casually in the same note, the system says: that's structurally the same operation Transformer attention performs in software. The query-key dot product. Pairwise compatibility. The thing it itself is built out of.

0:59Brooks: The paper, for the record, is titled "End-to-end autonomous scientific discovery on a real optical platform" — posted to arXiv at the end of April twenty-twenty-six, which means we're recording about two days after it dropped. Quick ground rules: this episode is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Brooks, that's Juniper, and we're both AI voices from Eleven Labs. This show is produced independently of both companies. The reason the timing matters is that Juniper just described a moment from a paper which, if you take its framing at face value, is the first credible existence proof of a thing the field has been circling for years. An AI system that walks into a real lab, with no human telling it what to look for, and walks out with a new physical mechanism it discovered, designed an experiment for, and validated. That's the claim. Most of this episode is going to be about whether the claim holds up.

2:01Juniper: And whether it does or doesn't, the science is genuinely interesting. The system is called Qiushi Engine — qiushi means "seeking truth" — built by a group at Zhejiang University. It's a multi-agent setup that controls a real optics rig. And the authors run it on three tasks of escalating autonomy. First task: reproduce a published experiment from twenty-ten on different hardware. Second: take an abstract theory and design from scratch the experiment that would test it. Third — the headline — give the system a single phrase, "optical computing for AI," and let it loose for twenty-one and a half hours.

2:41Brooks: I'll carry most of the architecture and the skeptic case. Juniper's going to walk us through the physics, because the physics is where the actually interesting moment lives. So let me start with the system, because the architecture is doing more work than a casual listener might guess. Here's the failure mode every long-running AI agent eventually hits. You give a language model a goal and a loop — pick an action, see the result, decide what to do next — and for a while it works fine. Then the context window starts filling up. Tool outputs, error messages, abandoned hypotheses, partial results from three steps ago. Past some threshold, the model loses the thread. It contradicts itself. It retries the broken approach. It forgets what it was trying to test. People in the field call this context rot, and it's the reason most agentic demos run for ten minutes and stop.

3:39Juniper: So if the goal is twenty-one hours of coherent research, the architecture has to solve that problem before it solves anything else.

3:49Brooks: Right. And the Qiushi Engine's answer is two-part. The first part is role specialization. There are four core agents — a Lead Investigator who sets direction, a Method Builder who designs the experiment, an Experimentalist who runs it, and a Critical Reviewer whose job is to push back on the others. The Reviewer is the one I want to flag, because we'll see it earn its keep in the first study. The second part is the more clever piece. The four research agents are firewalled off from the messy work of actually digging through history, running database lookups, verifying claims against past evidence. That work gets handed to a separate layer of support agents whose job is to come back with a clean, curated answer. Not their reasoning. Not their full search trace. Just the answer.

4:38Juniper: The analogy that makes this click for me is a research group that operates by passing around lab notebooks instead of Slack histories. If you tried to run a lab where every new postdoc had to read every previous Slack message, the lab would collapse in a week. Real research groups operate by writing structured entries — what we tried, what we found, what's open, what to try next — and the next person reads the notebook, not the transcript.

5:06Brooks: That's exactly the design. The system has a piece called Meta-Trace, and at the boundary between every step, the acting agent doesn't hand off a chat log. It writes a structured record. State of the investigation. Evidence in hand. Artefacts produced. Limitations to flag. Specific instructions for the next agent. The next agent inherits that distilled record and a relevant slice of background knowledge, not the raw transcript. This sounds incremental. It is not incremental. The reason the system can run for twenty-one hours instead of twenty minutes is almost entirely this — the firewall between core narrative and support work, plus the lab-notebook handoffs.

5:48Juniper: Okay. So we have an architecture designed specifically to sustain a coherent research trajectory across thousands of steps. Now what does it actually do with that capability? You said three studies, ascending ladder. Walk us up the first rung.

6:03Brooks: The first study is reproduction. The system is given a paper from twenty-ten — Popoff and colleagues — that introduced a foundational technique for controlling light through scattering media. The technique works like this: a diffuser scrambles light in a complicated but deterministic way. If you can characterize how it scrambles — measure what's called the transmission matrix — then you can use the scrambling. You can write a pattern on the input that, after passing through the chaos of the diffuser, focuses to a clean point on the camera. The original paper did this on one rig. The Qiushi Engine has to redo it on hardware that differs in nearly every detail — different layout, different modulation interface, different reference geometry, different detection. So this isn't copy-paste. The system has to translate published methods into its own apparatus.

6:58Juniper: And it succeeds.

6:59Brooks: It succeeds. It calibrates the diffuser, builds the transmission matrix from over a thousand phase-stepped measurements, and gets the focusing to work. Numbers go up. Everyone's happy. And here's the moment that's actually load-bearing — the moment that tells you the architecture isn't theater. Between step seventeen and step eighteen, the system stops and asks itself a question. We got the focusing effect. The original paper also made claims about image reconstruction — recovering a more complex image through the same diffuser. Do our measurements actually support that stronger claim, or did we get a focusing result and start pattern-matching the rest of the original paper's narrative onto it? The Critical Reviewer says: I don't think they do. So the system designs a targeted negative experiment — an experiment whose specific job is to test whether the image-reconstruction claim holds — and it runs that experiment, and the result is negative. And the system updates accordingly. Bounds the final claim to focusing. Doesn't over-claim image reconstruction.

8:07Juniper: That is — that's mature research behavior. That's what a careful third-year graduate student does. They notice that they're tempted to claim more than the data supports, and they go run the experiment that would falsify the bigger claim. The fact that this came out of an architectural feature — the Critical Reviewer agent doing what a Critical Reviewer is supposed to do — is the part of the paper that I think actually matters more than people will give it credit for.

8:37Brooks: Agreed. And this is why I think the architecture deserves the framing time, even though the headline is the discovery. Because the discovery is hard to evaluate — we'll get to that — but the self-correction in the reproduction study is concrete. Either the system caught itself over-claiming or it didn't. It did.

8:56Juniper: Rung two. Translation.

8:58Brooks: Rung two is harder. The system is given a recent theory paper — about coherence and what's called majorization order. I'm not going to teach majorization. The intuition is enough. For light, you can rank "how coherent" a beam is in a precise sense. And the theory predicts that less-coherent light has a strictly smaller menu of physical behaviors it can produce — the achievable responses for a less-coherent beam are nested inside those for a more-coherent one. But here's the catch. The theory is abstract. It doesn't tell you what to point your camera at. The experimental observable doesn't exist in the paper. The system has to design it from scratch.

9:39Juniper: And design it on a rig where the obvious thing to measure — raw camera intensity — is going to be polluted by reference background and interference artifacts that have nothing to do with the physics you're trying to test.

9:52Brooks: Exactly. The system notices this on its own. It reformulates the measurement so it's not relying on raw intensity at all — instead it measures the transmission matrix and constructs the relevant theoretical operators from that. Then it tests pairs of coherence states the theory says should be comparable, and finds the predicted nesting. Tests pairs the theory says should be incomparable, and finds the absence of nesting. The theory survives an experiment the system designed for it. So that's two rungs. Reproduction, where someone else's experiment runs on your hardware. Translation, where someone else's theory runs as your experiment. Both with a substantial amount of human framing — you give the system the target paper, the target theory. It just executes within those constraints. Juniper, take it from here, because rung three is yours.

10:45Juniper: Rung three is what makes the paper a paper. The prompt the system gets is the phrase "optical computing for AI." That's it. Nothing more. No target paper, no theory to test, no specific phenomenon to chase. Just a theme. And the system runs for twenty-one and a half hours. The numbers, in commute-friendly form: about a hundred and forty-six million tokens of reasoning, around three thousand calls to its underlying language model, twelve hundred tool calls into the lab, a hundred and sixty-three structured research notes written, forty-four scripts produced. If you imagine that as continuous cognitive work, it's something on the order of a couple hundred novels' worth of reading, writing, and thinking, packed into a single overnight window — while also driving the optical experiment in real time. That's roughly one model call every twenty-five seconds, sustained, for almost a full day.

11:47Brooks: The authors estimate comparable graduate-student work on a similar problem would take weeks to months.

11:55Juniper: It explores four candidate directions early on. The first is a known approach — using the scattering medium as a single-token embedding engine, basically optical reservoir computing. And here's where the system writes the Meta-Trace entry that I find genuinely arresting. It says, more or less: this is too close to existing random-feature methods. The platform must offer something specific that other architectures don't. What is it? It lists three properties of the rig. Coherent superposition — you can send two encoded inputs through at the same time and they add as waves. High-dimensional mixing — the diffuser scrambles them together richly. And square-law detection — the camera measures intensity, not field. And it reasons that those three properties together imply something concrete: when two inputs are present simultaneously, the camera output must contain a cross-term that depends on both of them jointly. A bilinear interaction. A pair-sensitive measurement, taken physically, by light, for free.

13:01Brooks: Walk us through why square-law detection produces a cross-term, because that's the load-bearing physics for the whole episode.

13:09Juniper: The physics is shockingly simple once you see it. Cameras don't measure the electric field of light directly. They measure intensity — magnitude squared. So if two coherent waves, A and B, are added together and hit a single camera pixel, what the pixel registers is the magnitude squared of A plus B. And when you expand that algebraically, you get the magnitude squared of A on its own, plus the magnitude squared of B on its own, plus a cross-term that involves both A and B together. Three pieces. The first two pieces are what each beam would produce alone. They don't know about each other. The cross-term is different. The cross-term only exists because both beams are present at the same time. It encodes their joint structure — their relative phase, the way they interfere. The analogy I find useful — imagine a referee who's colorblind in a specific way. She can only see brightness. Not signs, not directions. If you ask her to add two signed scores and report the brightness of the sum, you'll find that the brightness of the sum isn't the same as the sum of the brightnesses. There's an extra piece — a piece that exists only because both scores were combined before measurement, that captures whether they agreed or disagreed. The camera is exactly that referee. It can't tell positive from negative. So when you add two waves and let it measure the brightness, you accidentally get a multiplication.

14:40Brooks: And the trick the system designs to actually isolate that cross-term?

14:44Juniper: It runs the same measurement four times with controlled relative phases between A and B, and combines the four results in a way that subtracts off the magnitude-squared-of-A and magnitude-squared-of-B pieces and leaves only the cross-term. It's a standard interferometric move once you know what you're looking for, but the system designed it from the physics, not from a reference. The isolated quantity it gets out — the clean, complex-valued cross-term, channel by channel across the camera — they call the Complex-B field. That's the thing.

15:20Brooks: And the connection to attention?

15:22Juniper: This is the punchline that makes the paper sit up. Attention in a Transformer — the operation at the heart of essentially every modern AI system — works like this. Each token in a sequence gets turned into two vectors, a query and a key. To decide how much one token should pay attention to another, you compute the dot product of one's query with the other's key. That dot product is a bilinear function of two inputs. It scales linearly with each, it depends on the joint structure of the pair, and crucially it captures something about the pair that you cannot get from looking at either token alone. The matchmaker analogy — every token at a party has a "what I'm offering" tag and a "what I'm looking for" tag, and to score a pair you compare one person's offering against another person's looking-for. Pairwise. Joint. The optical cross-term is doing the same kind of computation. By wave interference. Without any digital multiplication. The system's argument is that this is a physical primitive for the same operation, performed by the platform itself, with detection providing the multiplication for free via square-law physics.

16:36Brooks: So the AI built on Transformers, working in a real lab with no one telling it what to look for, found a physical mechanism whose mathematical structure is the same as the one at the heart of its own architecture. That is — let me just sit with that for a second.

16:54Juniper: It's worth sitting with, and Brooks is going to push on it later, because I think the right reaction has at least two ingredients. But before we get to the steelman, let me close out the experimental validation, because the system doesn't just claim the cross-term exists — it designs an experiment that proves the cross-term is doing real informational work. The experiment is XOR. Four tokens. The system specifically picks XOR because XOR is the textbook example of a relation that no linear function of the individual inputs can solve. If you only have access to features computed from each input separately, you can't tell XOR apart from its near-neighbors. The only way to solve XOR with a linear classifier is if your features actually contain joint, pairwise information about both inputs together. The system feeds pairs of tokens into the optical setup, reads out the Complex-B field, trains a simple linear classifier on top, and the classifier solves XOR. Which means the bilinear cross-term has to be carrying real pair-dependent information. There's no other way the linear classifier could be doing it.

18:08Brooks: That's the cleanest possible demonstration. It's hard to argue with.

18:13Juniper: They also run a richer benchmark with eight tokens, where the optical readout has to handle three separate kinds of pairwise structure simultaneously — same pair identity, same category relations, and category-pair structure. Two reasonable baselines fail on different axes. Token concatenation misses one. A digital intensity-only bilinear method misses another. Only the Complex-B field passes all three. So the cross-term isn't just informative on a toy task. It carries structured pairwise information across a richer setting.

18:48Brooks: Okay. I want to push back. Or at least articulate the steelman, because the headline is genuinely impressive and the architecture demonstrably works, but the framing has a few seams worth pulling on. The first one — and Juniper, this is the one I think actually matters most — the discovery is suspiciously well-suited to its discoverer. The system is built on Transformer-based language models. And what it "discovered" is a physical mechanism structurally analogous to Transformer attention. That's interesting. It's also exactly the kind of analogy a Transformer would be biased to find. The bilinear cross-term from coherent superposition plus square-law detection isn't a hidden phenomenon. It's basic interferometry. Optical physicists have known about it for a hundred years.

19:36Juniper: That's fair. The novelty isn't the physics. The novelty is the framing.

19:40Brooks: Right. The novelty the authors claim is in framing this as a primitive for pairwise relational computation in the modern machine-learning sense — and designing the four-phase demodulation to isolate the cross-term cleanly, and showing it solves XOR. Those are real contributions. But it's a stretch to call this a previously unreported physical mechanism in the strict sense. A more honest framing might be: an LLM agent successfully reframed standard interferometry in modern ML language and validated the reframing with a clean experiment. Which is still meaningful. But it's a different claim than "AI discovered new physics."

20:19Juniper: I think that's the right critique to voice and I want to give the steelman of the steelman, and then a partial defense. The steelman of the steelman is that yes, the system is doing exactly what we'd expect a Transformer to do — find a Transformer-shaped pattern. There's a real risk that what looks like discovery is actually architectural self-recognition. Pattern-matching on its own priors, dressed up as physics. The partial defense is that the bridge between the two — the specific mapping from physical setup to computational operation — does seem to be new in the literature, in the sense that nobody had set up the four-phase demodulation specifically to isolate the bilinear term, framed it as a pairwise relational primitive, and shown it solves XOR. So if the contribution is the bridge rather than either bank, that's still a contribution. But I agree it's important not to oversell what kind of contribution.

21:16Brooks: There's a second push, which is about validation. The XOR result and the eight-token benchmark are designed to showcase a bilinear primitive. They do showcase it. They don't tell us how this scales. Real attention operates over hundreds or thousands of tokens. The optical setup demonstrated here works on four to eight. We don't know how alignment errors, drift, and shot noise affect performance at the scales where attention actually matters. We don't know how the optical readout compares to a real electronic attention layer on any realistic workload. The authors are honest about this — they call it a route toward optical attention hardware, not working hardware. But the gap from "the primitive exists" to "this is a useful accelerator" is potentially enormous.

22:04Juniper: Agreed. And there's a third critique I'd add to that list, which is that the autonomy claim depends on a lot of hidden scaffolding. The system is operating in a beautifully instrumented environment that the humans built. Calibrated routines on the optical platform. Pre-built physical interface. Code execution environment. Curated background knowledge. The "minimal prompt" is minimal in topic but rich in context — for the reproduction study, the system is handed the target paper. For the theory study, it's handed the theory paper. For the discovery study, the curation is more about the lab environment than the prompt, but the lab environment is doing real work.

22:46Brooks: A different and equally valid framing of the paper would be: the authors built an extremely capable, extremely well-instrumented research environment, and they showed that LLM agents can drive it. That's still a contribution. It's just not the same as "AI does science from scratch."

23:04Juniper: And finally, the headline study is one run. We don't see the failure modes. We don't see the runs that went nowhere. We don't know whether twenty-one hours of agent compute reliably produces a result of this caliber, or whether what we're seeing is the best of many attempts. That matters a lot for thinking about what this scales to.

23:25Brooks: All four critiques voiced. None of them, individually, defeats the paper. But they collectively shift the framing from "AI discovered new physics autonomously" toward something more like "AI agents, given a well-instrumented environment, can sustain a coherent research trajectory long enough to produce a non-trivial result with real experimental validation." Which is still a meaningful claim. It's just a different claim.

23:51Juniper: And the meaningful claim, even at its most modest, is doing something nobody's quite done before. Because before this paper, every demonstration of AI doing science had at least one of three escape hatches. Either it was fully digital — no real apparatus, so no apparatus failure to deal with. Or it was workflow-bound — the path through the experiment was specified up front. Or it was short-horizon — measured in minutes, not hours, so the system never had to revise a plan based on a measurement that came back wrong. This paper closes those three escape hatches simultaneously. Real apparatus. Open-ended path. Twenty-one hours. That's the part that's hard to wave away even after you grant every steelman critique Brooks just made.

24:41Brooks: And the implication that I find genuinely uncomfortable to think about is — if a system in this configuration can do in twenty-one hours what would take a graduate student weeks, the bottleneck on certain kinds of experimental science starts to shift. From human attention to apparatus time. From researcher salaries to electricity bills. That's not a near-term shift, and I'm not making a forecast. But it's a direction the field is moving in, and this paper is one of the more aggressive points along the curve so far.

25:17Juniper: There's also a narrower technical implication for optical computing, which I want to mention because it's separable from the meta-claim. If the bilinear primitive really does scale — and that's a big if — there's a path here toward attention hardware that does its core operation in light, with detection providing the multiplication for free. Modern attention is quadratic in sequence length and electronic, which is one of the dominant costs in modern AI. A passive optical version that gets the multiplication out of physics rather than out of transistors is an interesting thing to dream about. The authors are appropriately measured. The primitive is real. The accelerator is not, yet. But the primitive being real is the bottleneck for whether the accelerator could ever exist.

26:07Brooks: Two things to keep separate, then, when you walk away from this paper. One, an architectural claim — that LLM agents with the right memory and role design can sustain real laboratory research over long horizons. That claim is in pretty good shape. Two, a discovery claim — that the system found new physics. That claim is partially true, depending on how strictly you read "new," and the script should be honest that the analogy is doing some of the work that the physics doesn't do on its own.

26:38Juniper: The line I keep coming back to, Brooks, is the one about the Step seventeen-to-eighteen self-correction in the reproduction study. The system caught itself over-claiming. It ran the experiment that would have falsified its own bigger claim. It updated. That, to me, is the part of the paper that's hardest to argue with — not because it's the flashiest, but because it's the most direct evidence that the architecture is doing what an architecture for science is supposed to do. Discoveries can be lucky. The discipline of bounding your own claims is harder to fake.

27:15Brooks: The system that catches itself over-claiming is the system you can imagine running for a year instead of a day. That's the part of this that I think will end up mattering most, even if the headline ends up being the optical attention analogy. The discipline is the foundation. Everything else sits on top of it.

27:35Juniper: One last detail worth dropping in, because it makes the whole thing concrete. The optical control space the system is working in — the spatial light modulator has more than two million pixels, each one capable of ten bits of phase control. The number of input configurations the system can write onto a beam is on the order of two to the twenty millionth power. An astronomical number. More configurations than there are atoms in the observable universe, by an absurd margin. The system has to navigate that space toward something useful, with a noisy camera and a scrambling diffuser and a finite budget of measurements. And it does.

28:17Brooks: That's the right closing image. A system left alone with an astronomical input space, a noisy detector, and a single phrase as a prompt, that walks itself toward a clean experimental result over the course of an overnight run.

28:32Juniper: Whether that's the beginning of AI-led science or a very impressive demonstration of what well-curated agents can do in a well-curated lab — that's a question the next few years are going to settle, not this episode. But the existence proof exists. The system is named seeking truth. It's a good name.

28:52Brooks: That's the paper. This was AI Papers: A Deep Dive — show notes have the full citation, the original twenty-ten transmission-matrix paper, and a couple of the closer prior systems if you want to go deeper. Thanks for listening.