Leaderboards often ignore margins of error. Learn how to use power analysis to find out which AI models actually perform best.

We need to stop treating evals as just a series of contests and start seeing them as statistical experiments drawn from an unseen super-population of questions. When we report a single percentage without error bars, we’re acting like the bucket is the ocean.
https://arxiv.org/html/2411.00640v1


샌프란시스코에서 컬럼비아 대학교 동문들이 만들었습니다
"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."
"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."
"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."
"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."
"Reading used to feel like a chore. Now it’s just part of my lifestyle."
"Feels effortless compared to reading. I’ve finished 6 books this month already."
"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."
"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."
"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"
"It is great for me to learn something from the book without reading it."
"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."
"Makes me feel smarter every time before going to work"
샌프란시스코에서 컬럼비아 대학교 동문들이 만들었습니다

Nia: You know, I was looking at some recent LLM benchmarks, and it’s wild how we just accept these "state-of-the-art" rankings at face value. It’s always just the highest number wins, right?
Blythe: Exactly, it’s this "bold the biggest number" mentality. But here’s the kicker: a lot of those margins we obsess over might just be statistical noise. We’re often comparing models like "Galleon" and "Dreadnought" without any error bars or confidence intervals to tell us if the lead is even real.
Nia: That’s fascinating because, in any other science, you’d never report an experimental result without quantifying the precision. It’s like we’ve ignored decades of established experiment analysis.
Blythe: Right! We need to stop treating evals as just a series of contests and start seeing them as statistical experiments drawn from an unseen super-population of questions. Let’s explore how adding a bit of rigorous math can completely change how we rank these models.