Benchmark Wars 2026: ARC-AGI-2, GPQA Diamond, and the HLE Scoring Controversy
Gemini 3.1 Pro leads on ARC-AGI-2 (77.1% vs 68.8%) and GPQA Diamond (94.3% vs 91.3%). GPT-5.3-Codex dominates Terminal-Bench 2.0 at 77.3% and CyberSec CTF at 77.6%. Then the Humanity's Last Exam results detonated a credibility crisis: Anthropic reported 66.6% for Claude while independent evaluators
Read story