4
Predictable Scaling and the Science of Growth 9:26 Miles: You know, Lena, one of the most underrated parts of the GPT-4 technical report isn't the exam scores—it’s the "Predictable Scaling." For the first time, OpenAI proved they could accurately predict how well a massive model would perform before they even finished training it.
9:43 Lena: Wait, how is that possible? If the model is so much bigger than anything that came before, how can you know what it’ll do?
9:50 Miles: It’s all about Scaling Laws. Back in 2020, researchers at OpenAI—led by Jared Kaplan—found that the loss—the error rate—of a language model follows a very predictable "Power Law." It’s tied to three things: the number of parameters, the amount of training data, and the total compute budget.
10:08 Lena: So, like a mathematical formula for intelligence?
1:49 Miles: Exactly. They were able to train a model with 1,000 to 10,000 times less compute than GPT-4 and use those results to predict GPT-4’s final performance with remarkable accuracy. This was a huge milestone for AI safety. If you can predict the capabilities of a model before you build it, you can plan for the risks.
10:33 Lena: That makes sense. You wouldn't want to accidentally build something that has capabilities you weren't prepared to manage. But I heard there was some drama with these scaling laws—something about "Chinchilla"?
10:43 Miles: Oh, the Chinchilla Scaling Laws! That was a massive correction from DeepMind in 2022. Kaplan’s original laws suggested that if you have more compute, you should mostly just make the model bigger. That’s why GPT-3 was such a giant. But DeepMind showed that most of these models were actually "under-trained." They had too many parameters and not enough data.
11:05 Lena: So it’s like having a giant brain but not enough books to read?
11:09 Miles: Exactly! Chinchilla suggested that for every time you double the model size, you should also double the amount of training data. A 70-billion parameter model trained on 1.4 trillion tokens—like Chinchilla—could actually outperform a 175-billion parameter model like GPT-3 that was only trained on 300 billion tokens.
11:31 Lena: That explains why the newer "mini" models are so good. They’re smaller, but they’ve been "fed" a lot more high-quality data.
11:38 Miles: Spot on. And this leads to another fascinating concept: "Emergent Abilities." While the overall loss—the error rate—decreases smoothly as you scale, specific tasks often show a "sudden" jump. A model might have zero percent accuracy on a math problem at one scale, and then suddenly hit 50 percent just by adding a bit more compute.
11:59 Lena: That sounds a bit unpredictable, though. If abilities just "emerge," how can you say scaling is predictable?
12:06 Miles: That’s the million-dollar question. While we can predict the overall "intelligence" or loss, we can't always predict exactly when a specific skill—like complex logical reasoning or coding—will click into place. It’s one of the reasons the GPT-4 report was so careful about testing the model on such a wide range of human exams. They wanted to see where those jumps happened.
12:30 Lena: And the jumps were massive. I mean, going from the 10th percentile to the 90th percentile on the Bar Exam is a qualitative shift. It’s not just "better at guessing the next word"; it’s a fundamental change in how the model handles legal reasoning.
12:45 Miles: It really is. And it’s not just English. The report showed that GPT-4 actually outperformed the English-language state-of-the-art in 24 out of 26 languages they tested, including languages like Mandarin and Ukrainian. It turns out that when you learn the underlying structure of reasoning in one language, it translates—literally—to others.
13:06 Lena: That has huge implications for global equity. If a model can provide high-level medical or legal advice in a language that typically has very little digital data, you’re suddenly leveling the playing field for millions of people.
6:53 Miles: Absolutely. But as Stuart Russell warns in "Human Compatible," we have to be sure the goals we give these systems are truly aligned with human values. Because if a super-intelligent system is "predictably scaling" but pursuing a goal that’s slightly off—like a genie that takes your wish too literally—things could get messy very quickly.