State-of-the-art artificial intelligence (AI) models from Alibaba, Google, Meta, Microsoft, Mistral AI, and OpenAI have come under recent scrutiny for allegedly “cheating” on AI benchmarking tests, writes Tristan Greene.
Evidence presented by whistleblowers and analysts demonstrates that specific AI models can be made to output the test sets for at least two popular benchmarks — MMLU and GSM8K. At a minimum, they say, this indicates data contamination and calls into question the veracity of each models’ benchmark scores. In the worst case, it could be indicative of widespread deceit in the corporate AI sector.