Are We Done with MMLU?

Our analysis uncovers significant issues with the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to assess LLMs. We found numerous ground truth errors, with 57% of the Virology subset’s questions containing inaccuracies, obscuring the true capabilities of models.

To address this, we introduce a new error taxonomy and create MMLU-Redux—a refined subset of 3,000 manually re-annotated questions across 30 subjects. Our experiments with MMLU-Redux reveal notable discrepancies in previously reported model performance metrics, underscoring the need for revising MMLU’s flawed questions. We invite the community to contribute to further annotations to enhance the reliability of this important benchmark.