Why AI benchmarks are more uncertain than they look

23 min

31 мар. 2026 г.

AI leaderboards often ignore statistical noise. Learn how Anthropic’s new approach to error bars provides a more accurate way to rank model performance.

Лучшая цитата из Why AI benchmarks are more uncertain than they look

Statistics is the science of measurement in the presence of noise. AI evaluations are, by their nature, incredibly noisy; this isn't about making the noise go away—it’s about learning how to work with it honestly and precisely.

Этот аудиоурок был создан участником сообщества BeFreed

Вопрос для ввода

https://www.anthropic.com/research/statistical-approach-to-model-evals and https://arxiv.org/html/2411.00640v1

Голоса ведущих

Nia

Jackson

Стиль обучения

Глубокий

Источники знаний

What Is ChatGPT Doing ... and Why Does It Work?

Artificial Intelligence and Generative AI for Beginners

Часто задаваемые вопросы

The question universe is the theoretical sum of all possible questions that could represent a specific skill, such as physics, law, or coding. Current AI benchmarks like MMLU or MATH only use a small sample of these questions. Anthropic’s research suggests that a model's score should not be viewed as an absolute truth, but rather as an estimate of its performance across this entire unseen super-population. Without acknowledging this "universe," researchers may mistake a model's luck on a specific set of questions for actual underlying mastery of a subject.

Standard statistical math often assumes every question is an independent event, but many evaluations use "clustering," where multiple questions are tied to a single long passage. If a model misunderstands a specific passage, it will likely miss all related questions, meaning the questions are not independent draws. Ignoring this clustering can result in standard errors that are three times smaller than they should be, giving researchers a false sense of confidence in results that might actually be statistical noise.

Instead of forcing a model to pick a single answer (like "A" or "B"), researchers can look at the internal probability the model assigns to the correct token. For example, if a model assigns a 72% probability to the correct answer, it receives a score of 0.72. This method eliminates the randomness associated with token generation and "temperature" settings. It provides a more nuanced, continuous score that can reduce measurement variance by up to two-thirds compared to traditional pass/fail grading.

A paired-difference analysis compares two models by looking at how they performed on the exact same questions, rather than just comparing their final average scores. Since frontier models often struggle with or excel at the same specific questions, their results are highly correlated. By focusing on the "gap" per question, researchers can subtract out the noise caused by question difficulty. This makes the measurement of the difference between two models much more precise and can even reveal that a model with a lower average score is actually the statistically significant winner.

Power Analysis is a mathematical formula used to determine if an evaluation is sensitive enough to detect a real difference between models before the test is even run. It helps researchers calculate the necessary sample size—often requiring at least a thousand independent questions—to ensure a result isn't just a false positive. This prevents researchers from "weighing a diamond on a bathroom scale" by ensuring the test has enough statistical power to see small performance gains, such as a 2% or 3% improvement.

Узнать больше

ПЛАН ОБУЧЕНИЯ

AI Decision Models: Constraints & Failures

As AI systems increasingly make consequential decisions in healthcare, finance, and public safety, understanding their limitations becomes critical. This plan equips professionals and decision-makers with the knowledge to evaluate AI systems realistically and build more reliable models that avoid common pitfalls.

5 h 56 m•4 Разделы

ПЛАН ОБУЧЕНИЯ

AI: weigh benefits & risks

As AI rapidly transforms every sector from healthcare to education, understanding its true potential and risks has become essential for informed citizenship and professional relevance. This learning plan equips anyone—whether business leaders, policymakers, students, or concerned citizens—with the critical thinking framework needed to navigate our AI-integrated future responsibly and effectively.

5 h 38 m•4 Разделы

ПЛАН ОБУЧЕНИЯ

The AI Engineering Blueprint

As AI shifts from simple chat interfaces to autonomous systems, engineering rigor becomes essential for reliability. This blueprint is designed for software engineers and architects looking to move beyond basic prompts to building scalable, production-ready AI infrastructure.

1 h 36 m•4 Разделы

ПЛАН ОБУЧЕНИЯ

The xAI Power Contradiction

This plan investigates the ethical and environmental tensions inherent in the race for AI supremacy. It is essential for environmental advocates, policy makers, and tech ethicists seeking to understand the real-world impact of xAI's infrastructure on local communities.

1 h 12 m•3 Разделы

ПЛАН ОБУЧЕНИЯ

Learn more about AI

As artificial intelligence reshapes every industry, understanding its technical and ethical foundations is no longer optional. This plan is ideal for professionals and students who want to move beyond the buzzwords to build actual systems while navigating the future of human-AI collaboration.

5 h 15 m•4 Разделы

ПЛАН ОБУЧЕНИЯ

Investing in the AI IPO Wave

As artificial intelligence companies transition from private unicorns to public entities, traditional valuation models often fail to capture their unique risk profiles. This plan is essential for institutional investors and financial analysts who need to bridge the gap between speculative hype and audited financial performance.

2 h•4 Разделы

ПЛАН ОБУЧЕНИЯ

Master Effective AI Use in the Organization

As AI reshapes the global economy, leaders must move beyond basic awareness to strategic execution. This plan is designed for executives and managers who need to bridge the gap between technical potential and organizational reality while ensuring ethical oversight.

5 h 36 m•4 Разделы

ПЛАН ОБУЧЕНИЯ

AI Myths: LLMs vs. True Sentience

This learning plan is essential for anyone looking to look past the headlines and understand the actual capabilities of modern AI. It is particularly valuable for tech enthusiasts, students, and professionals who want to ground their understanding of machine intelligence in both science and philosophy.

5 h 45 m•4 Разделы

Создано выпускниками Колумбийского университета в Сан-Франциско

BeFreed объединяет глобальное сообщество из 1,000,000 любознательных умов

Узнайте больше о том, как обсуждают BeFreed в интернете

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

Создано выпускниками Колумбийского университета в Сан-Франциско

BeFreed объединяет глобальное сообщество из 1,000,000 любознательных умов

Узнайте больше о том, как обсуждают BeFreed в интернете

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

"Instead of endless scrolling, I just hit play on BeFreed. It saves me so much time."

@Moemenn

"I never knew where to start with nonfiction—BeFreed’s book lists turned into podcasts gave me a clear path."

@Chloe, Solo founder, LA

117

"Perfect balance between learning and entertainment. Finished ‘Thinking, Fast and Slow’ on my commute this week."

@Raaaaaachelw

"Crazy how much I learned while walking the dog. BeFreed = small habits → big gains."

@Matt, YC alum

108

"Reading used to feel like a chore. Now it’s just part of my lifestyle."

@Erin, Investment Banking Associate , NYC

254

"Feels effortless compared to reading. I’ve finished 6 books this month already."

@djmikemoore

"BeFreed turned my guilty doomscrolling into something that feels productive and inspiring."

@Pitiful

4.5K

"BeFreed turned my commute into learning time. 20-min podcasts are perfect for finishing books I never had time for."

@SofiaP

"BeFreed replaced my podcast queue. Imagine Spotify for books — that’s it. 🙌"

@Jaded_Falcon

201

"It is great for me to learn something from the book without reading it."

@OojasSalunke

"The themed book list podcasts help me connect ideas across authors—like a guided audio journey."

@Leo, Law Student, UPenn

483

"Makes me feel smarter every time before going to work"

@Cashflowbubu

1.5K Ratings4.7

Начните своё обучение прямо сейчас

Ключевые выводы

Beyond the AI Leaderboard Hype

0:00

0:14

0:25

0:36

0:47

The Hidden Question Universe

0:56

1:10

1:23

1:51

0:36

2:24

2:39

2:59

3:07

3:35

3:51

4:13

4:27

The Problem of Related Questions

4:42

0:36

5:16

5:34

5:54

0:36

6:17

6:30

6:50

7:05

7:24

3:07

7:51

8:03

Turning Down the Statistical Noise

8:22

8:37

8:50

8:53

9:09

9:15

9:30

0:36

9:56

5:34

10:32

10:41

10:59

11:09

11:22

11:33

11:50

11:54

12:10

0:36

The Power of Paired Comparisons

12:40

12:52

13:07

13:09

13:17

0:36

13:32

13:40

13:57

14:02

14:20

0:36

14:40

14:55

15:08

15:12

15:29

11:33

15:46

15:57

Planning for Statistical Power

16:13

16:23

16:43

0:36

17:06

5:34

17:26

17:31

17:51

3:07

18:09

18:22

A Practical Playbook for AI Researchers

18:38

18:53

19:07

0:36

19:27

19:42

19:56

20:13

20:23

20:32

20:43

20:52

21:16

Final Reflections on the Science of Evals

21:24

11:33

22:01

0:36

22:31

22:38

22:52

3:51

23:10

23:20

23:29

23:43

23:51

Why AI benchmarks are more uncertain than they look

Лучшая цитата из Why AI benchmarks are more uncertain than they look

Этот аудиоурок был создан участником сообщества BeFreed

Часто задаваемые вопросы

Узнать больше

Why AI benchmarks are more uncertain than they look

Лучшая цитата из Why AI benchmarks are more uncertain than they look

Ключевые выводы

Beyond the AI Leaderboard Hype

The Hidden Question Universe

The Problem of Related Questions

Turning Down the Statistical Noise

The Power of Paired Comparisons

Planning for Statistical Power

A Practical Playbook for AI Researchers

Final Reflections on the Science of Evals

Похожий контент

Этот аудиоурок был создан участником сообщества BeFreed

Часто задаваемые вопросы

Узнать больше

Ключевые выводы

Beyond the AI Leaderboard Hype

The Hidden Question Universe

The Problem of Related Questions

Turning Down the Statistical Noise

The Power of Paired Comparisons

Planning for Statistical Power

A Practical Playbook for AI Researchers

Final Reflections on the Science of Evals

Похожий контент

Why AI benchmarks are more uncertain than they look

Лучшая цитата из Why AI benchmarks are more uncertain than they look

Этот аудиоурок был создан участником сообщества BeFreed

Часто задаваемые вопросы

What is the "question universe" and why does it matter for AI benchmarks?

Why can standard error calculations be inaccurate for reading comprehension tests?

How can "next-token probabilities" improve the accuracy of an AI evaluation?

What is a paired-difference analysis and why is it superior for comparing models?

How does "Power Analysis" help in designing better AI experiments?

Узнать больше

Why AI benchmarks are more uncertain than they look

Лучшая цитата из Why AI benchmarks are more uncertain than they look

Ключевые выводы

Beyond the AI Leaderboard Hype

The Hidden Question Universe

The Problem of Related Questions

Turning Down the Statistical Noise

The Power of Paired Comparisons

Planning for Statistical Power

A Practical Playbook for AI Researchers

Final Reflections on the Science of Evals

Похожий контент

Этот аудиоурок был создан участником сообщества BeFreed

Часто задаваемые вопросы

What is the "question universe" and why does it matter for AI benchmarks?

Why can standard error calculations be inaccurate for reading comprehension tests?

How can "next-token probabilities" improve the accuracy of an AI evaluation?

What is a paired-difference analysis and why is it superior for comparing models?

How does "Power Analysis" help in designing better AI experiments?

Узнать больше

Ключевые выводы

Beyond the AI Leaderboard Hype

The Hidden Question Universe

The Problem of Related Questions

Turning Down the Statistical Noise

The Power of Paired Comparisons

Planning for Statistical Power

A Practical Playbook for AI Researchers

Final Reflections on the Science of Evals

Похожий контент