AI benchmarking scandal: Were top models caught cheating?

AI benchmarking scandal: Were top models caught gaming the system?

Intentional? Or incidental to the nature of large-scale data scraping?

Tristan Greene

Jan 13, 2025 - 6 min read

State-of-the-art artificial intelligence (AI) models from Alibaba, Google, Meta, Microsoft, Mistral AI, and OpenAI have come under recent scrutiny for allegedly “cheating” on AI benchmarking tests, writes Tristan Greene.

Evidence presented by whistleblowers and analysts demonstrates that specific AI models can be made to output the test sets for at least two popular benchmarks — MMLU and GSM8K. At a minimum, they say, this indicates data contamination and calls into question the veracity of each models’ benchmark scores. In the worst case, it could be indicative of widespread deceit in the corporate AI sector.

Get the full story: Subscribe for free

Get the story, a weekly newsletter (you can turn that off if you want) and help us fight bots and feral algorithms. Subscribe today.

Subscribe now

Already a member? Sign in