Inverse Scaling in Test-Time Compute

July 19, 2025

We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. We identify five failure modes when models reason for longer: Claude models become distracted by irrelevant information, OpenAI o-series models overfit to problem framings, models shift from reasonable priors to spurious correlations, difficulties maintaining focus on complex deductive tasks, and extended reasoning may amplify concerning behaviors.