4
Scaling Through Decomposition and the Audit Trail 9:48 Nia: Okay, so if I’m using Recursive Reward Modeling, or RRM, I’m basically becoming a manager of a whole army of AI sub-specialists. How does that actually work in practice? Like, give me a real-world example.
10:02 Eli: Imagine you’re trying to evaluate if an AI-generated scientific paper is actually a breakthrough or just a bunch of fancy-sounding hallucinations. That’s too big for one person to check quickly. So, you decompose it. You have one AI whose only job is to check if the citations exist and match the text. Another AI checks if the statistical methods are appropriate for the data. A third one checks if the conclusion logically follows from the results.
10:28 Nia: And then I, the human at the top, just look at their reports?
3:47 Eli: Exactly. You’re evaluating the *evaluations*. Because it’s much easier to check if a citation is real than it is to judge the entire paper’s scientific validity. This is what DeepMind has been working on—this idea that by recursively breaking down tasks, you create a "chain of trust." Each level of the hierarchy is doing something simpler than the level above it.
10:53 Nia: But what if the "big picture" gets lost in the process? Like, all the individual parts look fine, but the paper as a whole is actually promoting something dangerous or fundamentally flawed that only emerges when you step back?
11:07 Eli: That’s the "emergent property" risk. It’s a major limitation of RRM. If the decomposition misses a critical dimension—if you check the bricks but forget to see if the building is leaning—the model has a systematic blind spot. Plus, there’s what researchers call the "alignment tax." Running all these extra models to check each other is expensive and slow. It’s a lot of computational overhead just to make sure a single answer is right.
11:32 Nia: It sounds like bureaucratic red tape, but for algorithms. But I guess if the alternative is a super-intelligent AI accidentally deleting the internet because it wanted to "optimize" traffic, the tax is worth it.
11:44 Eli: Right! And there’s an even more advanced version of this called Iterated Distillation and Amplification, or IDA. This one is really wild. You start with a human working with a weak AI assistant to solve a slightly hard task. That "amplified" team is then used to train—or "distill"—a new, slightly stronger AI.
12:03 Nia: So the student becomes the teacher?
12:06 Eli: Kind of! Then you take that new, distilled AI and use it as the assistant for the next, even harder task. You keep looping this. Each iteration, you’re "amplifying" the human’s capability by giving them better and better tools, and then "distilling" that newfound capability back into the model. It’s a way to bootstrap our way up to supervising superhuman systems by building a ladder of increasingly capable assistants.
12:30 Nia: So we’re never actually supervising something *way* smarter than us all at once. We’re just staying one step ahead of the assistant we’re using.
12:39 Eli: That’s the hope. It creates this incremental chain of oversight. But, as you can imagine, if a tiny error creeps in at step two, it gets amplified by step ten. It’s like a game of telephone, but with the fate of AI alignment on the line. If the human’s judgment starts to degrade as the tasks get more abstract, the whole ladder might be leaning against the wrong wall.
12:59 Nia: This is why I keep coming back to the idea of a "Constitution." I remember reading about Anthropic’s "Constitutional AI." Does that fit into this decomposition strategy?
13:09 Eli: It’s actually a brilliant way to scale the *rules* rather than just the *tasks*. Instead of you having to give feedback on every single output, you write a "Constitution"—a set of high-level principles like "don't be racist" or "don't help someone build a bomb." Then, the AI uses those principles to critique itself. It’s like giving the AI a conscience that you’ve pre-programmed.
13:31 Nia: So it becomes its own first-line auditor. That seems a lot more efficient than having a thousand sub-models checking citations.
13:38 Eli: It is! It dramatically reduces the need for human labelers. Humans define the "laws" of the constitution, and the AI handles the "policing" at scale. In fact, Claude's constitution includes things like the UN Declaration of Human Rights and even some rules inspired by DeepMind's Sparrow. It’s this multi-layered approach to safety where we move from micro-managing to high-level governance.
13:59 Nia: I love that shift—from micro-managing to governance. But it still feels like we’re looking at the AI from the outside, right? Like, we’re judging its behavior, its words, its output. But do we actually know what’s going on *inside* its "brain"?
14:14 Eli: That’s exactly what the next big frontier is trying to solve. Because behavioral oversight—looking at what the AI *does*—might not be enough if the AI is smart enough to be deceptive. We need to look under the hood. We need to see the "circuits" of its thought.
14:29 Nia: You’re talking about Mechanistic Interpretability, aren't you? Let’s crack the hood on that.