Photo: Unsplash
Why Benchmarks Are Lying to You About AI Progress
In August 2024, a team of researchers published a paper with an uncomfortable finding: GPT-4 appeared to have memorized portions of the GSM8K math benchmark — a test of grade-school arithmetic reasoning that GPT-4 was ostensibly achieving 92% accuracy on. When the researchers replaced numbers in the problems with different numbers (keeping the structure identical), accuracy dropped noticeably. The model had, in some meaningful sense, learned the test rather than the skill the test was supposed to measure.
Nobody in the field was particularly shocked. They were used to this.
What Benchmark Contamination Actually Is
The term “benchmark contamination” gets used loosely to describe two related but distinct problems, and conflating them obscures what’s actually happening.
The first problem is direct memorization: the exact questions and answers from a benchmark appeared in the training data. If the MMLU chemistry questions were scraped from a website that Common Crawl indexed, and Common Crawl is in the training data, and the model was trained on Common Crawl, then the model may have memorized the answers. When the benchmark is evaluated, the model retrieves answers rather than reasoning through them. This is cheating in the most literal sense, and it’s distressingly common because it’s very hard to audit — you’d need to know exactly what was in the training data, and for models the size of GPT-4, that’s not fully known even to the people who built them.
The second problem is subtler and more interesting: indirect contamination. Even if the exact benchmark questions weren’t in the training data, the internet contains extensive discussion, analysis, and explanations of benchmark problems. Papers discussing MMLU performance strategies, blog posts walking through GRE math approaches, Stack Overflow threads about algorithmic problems that end up as HumanEval test cases — all of this enters the training corpus and gives the model advantages on benchmarks that it wouldn’t have on genuinely novel tasks.
Think of it like the difference between studying the actual exam versus studying the course’s exams from the last five years. Both approaches raise your test score without necessarily improving your underlying knowledge. Only the first one counts as cheating in the technical sense.
The Goodhart’s Law Problem
Charles Goodhart was a Bank of England economist who, in 1975, observed that “when a measure becomes a target, it ceases to be a good measure.” This principle — Goodhart’s Law — was originally about monetary policy. It describes AI benchmark progress perfectly.
When the AI field decided that MMLU, HumanEval, and GSM8K were the measures of model capability, they became targets. Training decisions, architectural choices, and fine-tuning strategies were optimized to maximize performance on these specific measures. This works — the scores go up. What doesn’t necessarily follow is that the underlying capability the benchmark was intended to measure improves at the same rate, or at all.
HumanEval is a coding benchmark that tests whether models can write Python functions from docstrings. It was designed to measure general programming capability. Models achieving very high scores on HumanEval still demonstrably fail at tasks that any experienced programmer would consider basic: debugging complex multi-file codebases, reasoning about runtime behavior, understanding code that relies on domain-specific libraries not well-represented in training data. The HumanEval score went up. The programming capability — in the sense of “can I give this to a model and it will be as useful as a junior developer” — improved more slowly and less uniformly.
This isn’t surprising. The benchmark is a sample from a distribution of programming tasks. Optimizing for the benchmark means optimizing for the specific sample, not for the distribution. This works until you care about the distribution.
The “Held-Out” Myth
The standard response to benchmark contamination concerns is that benchmarks are “held out” from training — not included in the training data, so the model can’t memorize them. This is partially true and mostly insufficient.
Held-out benchmarks do solve the direct memorization problem, for a while. The problem is that once a benchmark is public, it doesn’t stay out of the training data forever. Researchers discuss it, papers analyze it, developers post their results on GitHub, bloggers write explainers. By the time a model is being trained a year or two after a benchmark’s release, the indirect contamination problem is severe even if the benchmark questions themselves were excluded.
More fundamentally: the set of possible “held-out” benchmarks is finite, and the field is consuming them at a rapid rate. A newly released benchmark typically takes 12-18 months to go from “state-of-the-art models struggle with this” to “state-of-the-art models achieve human-level performance.” New benchmarks are introduced to replace saturated ones, the models eventually saturate those too, and the cycle repeats. This creates a superficially impressive narrative of relentless progress that partly reflects genuine capability improvement and partly reflects benchmark saturation.
Big-Bench Hard, MATH, and GPQA were all introduced specifically because their predecessors (MMLU, GSM8K, and various high-school-level science tests) were getting saturated. As of early 2026, frontier models are approaching saturation on some GPQA subsets. The field will introduce new benchmarks, and the process will repeat.
What Actually Measuring AI Capability Would Require
Here’s the uncomfortable answer: genuinely measuring AI capability requires expensive, slow, domain-specific real-task evaluation. This is hard to do at scale and doesn’t produce the clean headline numbers that drive press releases and investor presentations.
Real evaluation looks like this: take actual software engineers and have them attempt actual work tasks with and without AI assistance, measuring completion time and quality. Take actual radiologists and have them evaluate AI-assisted diagnosis against their own diagnosis and against ground truth. Take actual lawyers and have them compare contract review with and without AI tools against eventual litigation outcomes. This kind of evaluation exists — there are academic studies and industry pilots that do exactly this — but it’s expensive, slow, domain-specific, and resistant to generalization. You can’t reduce it to a single number.
The field has known this for years. HELM (Holistic Evaluation of Language Models) from Stanford, released in 2022, was an explicit attempt to create a more comprehensive evaluation framework — measuring not just accuracy on benchmark tasks but calibration (does the model know when it doesn’t know?), efficiency, robustness, and fairness across demographic groups. It’s genuinely better than a single MMLU score. It’s also complex enough that it didn’t displace the headline benchmarks in most public discourse.
The economic incentive points away from rigorous evaluation. A complex, expensive, slow evaluation framework doesn’t help you ship a press release announcing that your new model beats the previous state of the art. A single number that you can compare to the previous model’s single number does. The benchmark system persists because it serves the interests of the companies releasing results, not because it accurately measures what they claim to be measuring.
The Specific Ways This Misleads
Three concrete failure modes that flow from the benchmark problem.
First: overconfident deployment. A company sees that its model achieves 90% accuracy on a medical reasoning benchmark and decides the model is ready for clinical decision support. The benchmark was drawn from publicly available medical licensing exam questions, many of which are in the training data. The actual performance on novel clinical cases is substantially lower. Patients are harmed. This has happened. The specific incidents are mostly not public.
Second: capability misattribution. When a new model achieves 10 points higher than its predecessor on MMLU, the press (and often the company’s own communications) interprets this as “10 points smarter.” But the improvement may be concentrated in areas where the training data coverage improved, not in any broad-sense increase in reasoning ability. The model got better at answering MMLU questions; we don’t actually know what else changed.
Third: comparative distortion. Company A releases a model that’s been heavily optimized for benchmark performance. Company B releases a model that’s more carefully fine-tuned for real-world task performance but scores slightly lower on benchmarks. The industry coverage uniformly treats Company A’s model as superior. The practitioners who actually deploy these models for real work discover the ranking doesn’t hold. This is a regular occurrence, and the people who notice it don’t tend to be the ones writing the headlines.
The New Benchmarks Are Already Being Gamed
The field’s response to benchmark saturation is to create harder benchmarks. ARC-AGI, released in 2019 by François Chollet and substantially updated since, was designed to be resistant to training data memorization by requiring novel analogical reasoning that couldn’t be covered by pattern-matching on training examples. OpenAI’s o3 model scored 87.5% on ARC-AGI in December 2024, prompting widespread commentary about whether this represented genuine abstract reasoning or a very expensive form of test-specific optimization.
The answer is both, and the fact that this is ambiguous tells you something about the limits of benchmarks as a category. ARC-AGI tests for specific reasoning patterns. Getting good at those patterns, whether through genuine generalization or through training approaches that happen to work well on those patterns, looks the same from the outside. The debate about whether o3’s performance represents “real” abstract reasoning or “just” expensive search over reasoning chains is not a debate that any existing benchmark can resolve. It requires an account of mechanism, not just a score.
GPQA (Graduate-Level Google-Proof Q&A) is another recent entrant — designed to be questions that PhD-level domain experts could answer but that couldn’t be retrieved from web search. As of early 2026, frontier models are achieving scores in the 70-75% range on GPQA, approaching expert human performance. What this will mean for the benchmark’s continued usefulness as a measure is predictable. Within another 12-18 months, at current trajectory, it will be saturated. A new benchmark will be designed. The cycle will repeat.
The Field Knows and Keeps Going
None of this is hidden knowledge. The benchmark contamination literature is extensive and well-known within ML research circles. Papers documenting saturation, contamination, and Goodhart’s Law effects are regularly published in top venues and regularly ignored in product announcements.
The incentives are the explanation. A benchmark score is a number you can compare. Comparison drives press coverage. Press coverage drives user acquisition. User acquisition drives revenue. The fact that the number is an imperfect proxy for the thing you actually care about is downstream of all these more immediate incentives, and it stays downstream.
What you can do: be appropriately skeptical of any claim about AI capability that comes with a benchmark number attached. Ask whether the benchmark is saturated. Ask whether the company reports independent real-task evaluation alongside the benchmark. (Most don’t. If they do, read the methodology carefully.) Ask whether the benchmark domains match the actual deployment use case. Ask when the benchmark was released and how long it’s been in public circulation.
You will not get good answers to most of these questions from a press release. That’s the point.




