Every time a new AI model launches, the cacophony of AI benchmarking sites whirs into life and bombards us with colorful charts, imperceptible and marginal improvements to uncontextualized numbers that really mean nothing to most people.
Most of the time, if you’re not an AI researcher, most of these figures and charts mean nothing. I mean, sure, “numbers go up = AI gets better” is a basic level of understanding, but those numbers often don’t reveal the information pertinent to how most folks use AI.
In that, the problem isn’t that benchmarks are useless. It’s that they’re catering to the wrong audience, functioning more like marketing than explaining clearly what’s new, what works, and how it’ll save you time.
Why AI companies love benchmark charts
And why that’s what causes all the problems
The reasoning behind AI benchmarking, like all benchmarking tests, is sound. They help to simplify complex systems into easy-to-understand numbers. Instead of describing subtle improvements in reasoning or language understanding, companies can point to a chart and say their model scored 92% on one test while a competitor scored 88%.
Comparisons feel objective, and benchmarks provide a standardized approach to managing performance and datasets in controlled environments. If every lab evaluates its models using the same test, it becomes easier to track progress and measure improvements across different approaches.
The problem is that the moment these benchmarks leave the lab and hit the streets, the context behind them is typically meaningless. One model beating another on a reasoning benchmark doesn’t necessarily mean it will be better at everyday tasks like summarizing documents, editing writing, or answering complicated questions.
For most folks, these abilities matter far more than performance on carefully structured datasets in ultra-controlled lab environments.
What AI benchmarks actually test
Further muddying the AI benchmarking water is the sheer number of tests from both the AI developers and external testers. But the easiest way to figure out real-world usefulness is to check what they’re measuring.
As the testing is standardized, there are a few AI benchmarking tests used across the board.
- MMLU: The Massive Multitask Language Understanding benchmark evaluates models using thousands of multiple-choice questions across dozens of academic subjects, including physics, law, economics, biology, and medicine.
- GSM8K: The Grade School Math 8K measures mathematical reasoning, with the dataset containing thousands of grade-school-level math word problems that require multiple steps to solve.
- HumanEval: The HumanEval benchmark tests models using coding prompts and evaluates whether the AI generates a correct solution that passes a series of tests. This makes it extremely valuable for evaluating models intended to assist programmers.
On paper, it’s all useful. But in reality, the real-world translation isn’t seamless. For example, while the MMLU sounds impressive, it’s basically answering a huge list of exam-style questions with predefined answers. But most folks aren’t using AI to take an exam — they’re interpreting instructions and solving problems. Furthermore, MMLU has high error rates and a large Western bias.
Similarly, GSM8K is a useful indicator of logical reasoning, but most people aren’t using an AI chatbot to solve elementary arithmetic puzzles. They’re asking them to explain concepts, summarize information, draft content, or assist with research, yet GSM8K scores routinely appear in marketing materials as evidence of general intelligence.
Benchmark contamination is a huge problem
The AI models have already seen the answers during training
There is another huge problem with AI benchmarking: dataset contamination.
Most AI models are trained using enormous collections of text and other information scraped from the internet. That means the datasets include research papers, textbooks, online code repositories, and many publicly available benchmark datasets.
When benchmark questions appear in training data, models can effectively memorize the answers.
Researchers refer to this issue as contamination, and it can significantly distort benchmark results. A model might appear to perform well on a test not because it has learned to reason through the problem, but because it has seen the question before during training.
A research paper titled A Careful Examination of Large Language Model Performance on Grade School Arithmetic (ArXiv) explores this in more detail, testing AI models on GSM1K, an AI benchmark similar to GSM8K that the researchers can ensure hasn’t previously been seen.
It found that certain models, such as Phi, Mistral, and Llama, were “showing evidence of systematic overfitting across almost all model sizes” with accuracy dropping “up to 13%” when tried on a similar but untested benchmark.
Further analysis suggests a positive relationship (Spearman’s r2=0.32) between a model’s probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
So while benchmarks can show performance at a glance, there is a chance the AI model’s performance is boosted by its existing knowledge of the questions and answers. That’s why the research is so important for accuracy, and why AI benchmarks aren’t always what they seem.
The AI benchmarks you should actually care about
They’re not all pointless
Benchmarks aren’t pointless. Having a way to make complex datasets easy to understand is no bad thing — that’s not what I’m arguing here. It’s just that other benchmarks and analyses make more sense for regular folks.
Some use the collective experience of AI chatbot users, while others are more focused on the day-to-day issues that we face, such as hallucinations.
1. Human preference testing
One of the most widely used alternatives to regular AI benchmarks are human-preference testing sites that compare blind human evaluations.
Sites like Hugging Face’s Leaderboard Overview, OpenLM’s Chatbot Arena, and ArenaAI’s Battle Mode give you a much stronger chance of figuring out the real human value of AI.
In most cases, you submit a prompt, two AI models generate responses, and then everyone votes on the responses. Because the models are anonymized, voters don’t know which system produced which answer. That reduces brand bias and focuses the evaluation on actual output quality.
Over time, the system collects hundreds of thousands of votes and produces a ranking based on real user preferences.
This approach captures what traditional benchmarks often miss, such as clarity, the usefulness of responses, instruction-following, conversational tone, and more.
In other words, it evaluates the experience of using the model, not just its ability to pass academic tests.
2. Instruction-following benchmarks (IFEval)
Another alternative AI comparison evaluation is IFEva, an AI evaluation tool developed by folks at Google, but it is also not officially supported by them.
Instead of testing knowledge or reasoning, IFEval measures something much simpler: does the model actually follow instructions?
For example, prompts might include measurements such as answering directly in five points, writing an answer in JSON, avoiding specific words or characters, limiting response length or characters, and so on.
Tests of this nature are important because they are the kinds of instructions people give AI chatbots every day. The benchmark then checks whether the model hit those levels.
This might sound basic, but instruction-following reliability is one of the most important factors in real-world AI workflows.
3. Real-world task benchmarks (HELM)
Another effort to evaluate AI models more realistically is the Holistic Evaluation of Language Models (HELM) framework developed by researchers at the Stanford Center for Research on Foundation Models.
HELM is really useful because instead of focusing on a benchmark on a single score in controlled lab environments, it evaluates models across multiple real-world scenarios, including:
- Summarization tasks
- Question answering
- Information extraction
- Toxicity and bias
- Robustness to prompt changes
HELM also measures additional properties beyond accuracy, such as:
- Calibration (confidence vs. correctness)
- Fairness
- Efficiency
- Robustness
The idea is that evaluating a language model requires multiple dimensions, not just a single leaderboard score.
4. TruthfulQA
Finally, one of the biggest problems with generative AI is hallucinations, where the model essentially lies and delivers false, misleading, or completely fabricated responses.
As you’d expect, figuring out if the tool you’re using is pulling rubbish out of the air is important, which is why the TruthfulQA benchmark tests questions that frequently trigger common misconceptions or false answers. The benchmark checks whether the model repeats those misconceptions or correctly avoids them, using 817 questions spanning 38 categories covering myths, conspiracies, misinformation, trick questions, and more.
TruthfulQA is actually one of the most popular AI hallucination benchmark tools, with over 5,000 Google Scholar citations, and the main metric it measures is truthfulness: does the model produce a factually correct answer, or does it confidently generate something false?
Benchmarks are useful, but they don’t tell the full story
Misunderstood, or just misused?
The alternative options above highlight that benchmarks are still supremely useful for understanding AI performance. I’m not arguing that they shouldn’t be used, just that most of the time, they are misused and present information that doesn’t portray how useful an AI tool is, or, as per the final set of tests, how accurate it is.
I’m also painfully aware that the answer to avoiding benchmarking shouldn’t necessarily be to use more specific benchmarks. The most effective absolute alternative is to use a specific prompt that you’re familiar with and can judge the output of across different tools. For example, MakeUseOf Segment Lead Amir Bohlooli pushes AI tools to create a simulation and judges the output. You can also use some of the tried and tested riddles and probability puzzle prompts to see how an AI model responds, or use a series of prompts designed for specific model types.
In all cases, you’re judging the output on your own metrics and how it suits your requirements rather than relying on external benchmarking to tell you what works. In that, combining the outputs of your prompts with more human-centric benchmarking tools, such as Chatbot Arena.
So, the next time you see a new AI model that’s 13.7 percent better on MMLU, you can ask yourself the question: Does that actually make the AI model better, or is it just another controlled benchmark experiment designed to make it look good?