Skip to content

Search the site

AIbenchmarkscheatingLLMsOpinion

AI benchmarking scandal: Were top models caught gaming the system?

Intentional? Or incidental to the nature of large-scale data scraping?

Image credit: https://unsplash.com/@nate_dumlao

State-of-the-art artificial intelligence (AI) models from Alibaba, Google, Meta, Microsoft, Mistral AI, and OpenAI have come under recent scrutiny for allegedly “cheating” on AI benchmarking tests, writes Tristan Greene.

Evidence presented by whistleblowers and analysts demonstrates that specific AI models can be made to output the test sets for at least two popular benchmarks — MMLU and GSM8K. At a minimum, they say, this indicates data contamination and calls into question the veracity of each models’ benchmark scores. In the worst case, it could be indicative of widespread deceit in the corporate AI sector. 

This post is for subscribers only

Subscribe

Already have an account? Sign In

Latest