Are top AI models actually smarter, or just lucky? Learn why benchmark margins of error are often understated and how to measure true model skill.

An empirical science is only as good as its measuring tools. We need to move away from 'vibe-based' engineering and toward actual, rigorous science by acknowledging the noise and uncertainty in AI benchmarks.
From Columbia University alumni built in San Francisco
"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."
"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."
"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."
"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."
"Reading used to feel like a chore. Now it’s just part of my lifestyle."
"Feels effortless compared to reading. I’ve finished 6 books this month already."
"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."
"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."
"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"
"It is great for me to learn something from the book without reading it."
"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."
"Makes me feel smarter every time before going to work"
From Columbia University alumni built in San Francisco

Nia: You know, Eli, I was looking at some AI leaderboards this morning, and it hit me—how do we actually know if one model is truly "smarter" than another, or if it just got lucky with the questions?
Eli: That is exactly the question Anthropic tackled in their research late last year. It turns out, "luck of the draw" is a huge factor. They found that when you account for how questions are grouped, the actual margin of error can be over three times larger than what researchers usually report.
Nia: Wait, three times? That’s a massive difference. It makes you wonder if some of these "breakthroughs" are just statistical noise.
Eli: Precisely. They’re proposing we stop looking at just the raw score and start thinking about a "question universe"—this theoretical space of all possible questions—to measure a model's underlying skill.
Nia: I love that mental model. So, let’s dive into how we can use tools like the Central Limit Theorem to finally add some much-needed error bars to these AI evals.