Benchmark Wars 2026: ARC-AGI-2, GPQA Diamond, and the HLE Scoring Controversy
Benchmark Analysis Benchmark Wars 2026: ARC-AGI-2, GPQA Diamond, and the HLE Scoring Controversy Gemini 3.1 Pro leads on ARC-AGI-2 (77.1% vs 68.8%) and GPQA Diamond (94.3% vs 91.3%). GPT-5.3-Codex dominates Terminal-Bench 2.0 at 77.3% and CyberSec CTF at 77.6%. Then the Humanity's Last Exam
Read story