Are We Done with MMLU?

Are We Done with MMLU?

Our analysis uncovers significant issues with the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to assess LLMs. We found numerous ground truth errors, with 57% of the Virology subset’s questions containing inaccuracies, obscuring the true capabilities of models.

To address this, we introduce a new error taxonomy and create MMLU-Redux—a refined subset of 3,000 manually re-annotated questions across 30 subjects. Our experiments with MMLU-Redux reveal notable discrepancies in previously reported model performance metrics, underscoring the need for revising MMLU’s flawed questions. We invite the community to contribute to further annotations to enhance the reliability of this important benchmark.

Share :

Get the latest industry insights delivered to you

Instagram

Whatsapp
2025 — Miniml Ltd. All rights reserved