Benchmark Wars 2026: ARC-AGI-2, GPQA Diamond, and the HLE Scoring Controversy

Benchmark Wars 2026: ARC-AGI-2, GPQA Diamond, and the HLE Scoring Controversy
Benchmark Analysis

Benchmark Wars 2026: ARC-AGI-2, GPQA Diamond, and the HLE Scoring Controversy

Gemini 3.1 Pro leads on ARC-AGI-2 (77.1% vs 68.8%) and GPQA Diamond (94.3% vs 91.3%). GPT-5.3-Codex dominates Terminal-Bench 2.0 at 77.3% and CyberSec CTF at 77.6%. Then the Humanity’s Last Exam results detonated a credibility crisis: Anthropic reported 66.6% for Claude while independent evaluators found 18.6%. What actually happened?

Benchmark Headlines

Key Benchmark Results: Claude Opus 4.6 vs Gemini 3.1 Pro

0
Gemini ARC-AGI-2

↑ Abstract reasoning lead [8]

0
Claude ARC-AGI-2

→ 8.3pp behind Gemini [8]

0
Gemini GPQA Diamond

↑ Expert-level reasoning [7]

0
HLE Scoring Dispute

↓ Methodology crisis [10]

0
GPT-5.3-Codex Terminal-Bench

↑ New coding benchmark leader [27]

0
GPT-5.3-Codex CyberSec CTF

↑ Security benchmark leader [27]

ARC-AGI-2: Abstract Reasoning and Fluid Intelligence

The ARC-AGI-2 benchmark (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) tests abstract pattern recognition and fluid intelligence — the ability to identify novel visual patterns and extrapolate rules from minimal examples. Unlike knowledge-heavy benchmarks, ARC-AGI-2 is specifically designed to resist memorization and training data contamination. [8]

Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2, while Claude Opus 4.6 scores 68.8% — an 8.3 percentage point gap that represents the largest single benchmark advantage either model holds over the other. This result aligns with Gemini’s multimodal architecture, which processes visual patterns natively rather than through text-mediated reasoning. [8]

The significance of this gap extends beyond the benchmark itself. ARC-AGI-2 is widely considered the closest existing proxy for measuring progress toward general intelligence. Gemini’s lead suggests that Google’s multimodal-native architecture may have fundamental advantages in pattern recognition tasks that text-specialized architectures cannot easily replicate. [8]

GPQA Diamond: Expert-Level Academic Reasoning

GPQA Diamond (Graduate-Level Google-Proof Q&A, Diamond subset) tests expert-level reasoning across physics, chemistry, biology, and mathematics. Questions are crafted by domain experts to be resistant to simple search or memorization — they require multi-step reasoning chains and genuine understanding of underlying principles. [7]

Gemini 3.1 Pro scores 94.3% versus Claude’s 91.3% on GPQA Diamond. Both scores are remarkably high — well above human expert baselines — and the 3.0 percentage point gap is narrower than ARC-AGI-2. At these performance levels, both models demonstrate near-expert proficiency in graduate-level scientific reasoning. [7]

The practical implication is that for enterprise applications involving scientific analysis, technical problem-solving, or academic-level reasoning, both models are highly capable. The marginal Gemini advantage on GPQA Diamond is unlikely to be the deciding factor in platform selection for these workloads. [7][15]

Head-to-Head Results

Complete Benchmark Comparison

Benchmark Claude Opus 4.6 Gemini 3.1 Pro GPT-5.3-Codex Leader
ARC-AGI-2 68.8% 77.1% Gemini (+8.3pp)
GPQA Diamond 91.3% 94.3% Gemini (+3.0pp)
SWE-bench Verified 80.6% 80.8% Statistical tie
SWE-Bench Pro 56.8% GPT-5.3-Codex
Terminal-Bench 2.0 65.4% 68.5% 77.3% GPT-5.3-Codex (+8.8pp)
CyberSec CTF 77.6% GPT-5.3-Codex
OSWorld 64.7% GPT-5.3-Codex
GDPval-AA 1,606 1,317 Claude (+289)
MCP Atlas 59.5% 69.2% Gemini (+9.7pp)
LiveCodeBench Pro ~2,750 Elo 2,887 Elo Gemini
HLE (Anthropic) 66.6% Disputed
HLE (Independent) 18.6% Disputed

The HLE Scoring Controversy

The Humanity’s Last Exam (HLE) was designed as the ultimate benchmark — a collection of questions so difficult that only the world’s top domain experts could answer them. It was intended to be the evaluation that no AI model could saturate, providing years of meaningful discrimination between increasingly capable systems. [10]

Then the scoring controversy detonated. Anthropic reported that Claude Opus 4.6 achieved 66.6% on HLE — a score that would represent an extraordinary leap in AI capability. Independent evaluators subsequently tested the same model on the same benchmark and found a score of approximately 18.6%. [10]

The 48 percentage point gap between vendor-reported and independently-measured scores is unprecedented in AI evaluation history. It immediately raised fundamental questions about benchmark methodology, evaluation integrity, and the reliability of vendor-reported performance claims. [10]

Several explanations have been proposed for the discrepancy:

Evaluation methodology differences: Anthropic may have used multiple attempts per question, selected best-of-N responses, or applied scoring criteria that differ from the benchmark’s standard methodology. If Anthropic’s evaluation allowed the model 10 attempts per question and counted any success, while independent evaluation used single-attempt scoring, the gap is mechanically explained. [10]

Test set selection: It is possible that Anthropic evaluated on a subset of HLE questions rather than the full benchmark, potentially one that was more favorable to its model’s capabilities. [10]

Dataset contamination: The most damaging hypothesis is that HLE questions or closely related content appeared in Claude’s training data, effectively turning a reasoning test into a recall test. This explanation, while not proven, is consistent with the pattern of vendor-reported scores dramatically exceeding independent evaluation. [10][12]

GDPval-AA: Where Claude Wins Decisively

Not all benchmarks favor Gemini. The GDPval-AA evaluation — reported by Artificial Analysis as a composite quality metric — gives Claude Opus 4.6 a score of 1,606 compared to Gemini’s 1,317. This 289-point gap (approximately 22%) represents Claude’s most decisive benchmark advantage. [7]

GDPval-AA measures overall output quality, coherence, and accuracy across a diverse set of real-world tasks. Unlike specialized benchmarks that test narrow capabilities, it attempts to capture the holistic user experience of interacting with the model. Claude’s significant lead here aligns with its philosophy of depth-first intelligence — producing higher-quality individual outputs rather than optimizing for breadth of capability. [7]

GPT-5.3-Codex: The Coding Benchmark Disruptor

OpenAI’s GPT-5.3-Codex introduces a third competitor that reshapes the benchmark landscape, particularly in coding and cybersecurity evaluations. Released in February 2026, it dominates several benchmarks that Claude and Gemini were previously contesting between themselves. [27]

On Terminal-Bench 2.0, GPT-5.3-Codex scores 77.3% — surpassing both Gemini’s 68.5% and Claude’s 65.4% by wide margins. This benchmark tests autonomous terminal operation, and Codex’s cloud sandbox architecture (where it operates with full filesystem and shell access in isolated containers) gives it a structural advantage. [27]

On the new CyberSec CTF benchmark, GPT-5.3-Codex achieves 77.6%, earning it the first “High” cybersecurity classification under OpenAI’s Preparedness Framework. This score reflects the model’s ability to autonomously identify vulnerabilities, exploit web applications, and solve capture-the-flag security challenges — capabilities that are as much a safety concern as a benchmark achievement. [27]

OSWorld (computer-use autonomy) at 64.7% and SWE-Bench Pro (harder variant of SWE-bench) at 56.8% complete the picture. GPT-5.3-Codex is not the highest-scoring model on all benchmarks, but it establishes a new category of evaluation: autonomous agentic task completion measured over minutes to hours, not seconds. [27]

Performance Visualization

Benchmark Performance Head-to-Head

ARC-AGI-2 (Claude)
68.8%
ARC-AGI-2 (Gemini)
77.1%
GPQA Diamond (Claude)
91.3%
GPQA Diamond (Gemini)
94.3%
SWE-bench (Claude)
80.6%
SWE-bench (Gemini)
80.8%
Terminal-Bench 2.0 (GPT-5.3-Codex)
77.3%
CyberSec CTF (GPT-5.3-Codex)
77.6%
OSWorld (GPT-5.3-Codex)
64.7%
SWE-Bench Pro (GPT-5.3-Codex)
56.8%

The Benchmark Credibility Crisis

The HLE controversy is the most dramatic symptom of a broader credibility crisis in AI evaluation. As models approach or exceed human performance on existing benchmarks, several systemic problems compound: [10][12]

Dataset contamination becomes increasingly likely as training corpora grow to encompass larger fractions of the internet, including previous benchmark questions and answers. A benchmark question that appeared on a forum discussion or in a blog post analyzing AI capabilities may be memorized rather than reasoned about. [10]

Ceiling effects reduce discriminative power. When both models score above 90% on GPQA Diamond and 80% on SWE-bench, the remaining gap is no longer meaningful as a capability differentiator — it falls within noise margins. [7]

Vendor-reported scores consistently exceed independent measurements, as demonstrated by the HLE controversy. Without mandatory independent evaluation, enterprise customers cannot make informed procurement decisions based on published benchmark numbers. [10]

The industry urgently needs a move toward mandatory independent evaluation — analogous to financial auditing standards — where benchmark scores undergo third-party verification before publication. Until that standard exists, enterprises should treat vendor-reported scores as upper bounds rather than expected performance. [10][12]

“A 48 percentage point gap between vendor-reported and independently-measured scores on the same model and the same benchmark is unprecedented. It raises fundamental questions about the reliability of every vendor-reported benchmark in the industry.”

— Humanity’s Last Exam scoring analysis, February 2026 [10]

Key Takeaways

  • Gemini leads on most traditional benchmarks: ARC-AGI-2 (+8.3pp), GPQA Diamond (+3.0pp), Terminal-Bench 2.0 (+3.1pp vs Claude), and MCP Atlas (+9.7pp) all favor Gemini in two-way comparisons.
  • GPT-5.3-Codex dominates coding-specific benchmarks: Terminal-Bench 2.0 77.3%, CyberSec CTF 77.6%, OSWorld 64.7%, SWE-Bench Pro 56.8% — establishing a new category of autonomous agentic evaluation.
  • Claude wins on output quality: GDPval-AA gives Claude a decisive 22% advantage (1,606 vs 1,317), meaning higher overall output coherence and accuracy.
  • SWE-bench is a statistical tie: 80.6% vs 80.8% — no meaningful difference in real-world code repair capability.
  • The HLE controversy undermines vendor-reported benchmarks: A 48pp gap (66.6% vs 18.6%) between Anthropic and independent evaluators demands mandatory third-party verification.
  • Benchmark saturation is real: Scores above 90% on multiple benchmarks mean existing evaluations no longer discriminate meaningfully between frontier models.
  • Treat vendor scores as upper bounds: Until independent evaluation becomes standard, enterprises should discount vendor-reported numbers and demand reproducible methodology.

References

  1. [7] “AI Model Benchmarks + Cost Comparison,” Artificial Analysis, February 2026. Available: https://artificialanalysis.ai/leaderboards/models
  2. [8] “ARC Prize: AGI Progress 2026,” ARC Prize Foundation, February 2026. Available: https://arcprize.org/blog/arc-agi-2-results-2026
  3. [10] “Humanity’s Last Exam: Results and Scoring Methodology,” Scale AI, February 2026. Available: https://lastexam.ai/results
  4. [12] “METR Task Standard Results,” METR, February 2026. Available: https://metr.org/blog/2025-03-19-metr-task-standard-results/
  5. [15] “Gemini vs Claude: A Comprehensive 2026 Comparison,” Voiceflow Blog, February 2026. Available: https://www.voiceflow.com/blog/gemini-vs-claude
  6. [27] “Introducing GPT-5.3-Codex,” OpenAI, February 2026. Available: https://openai.com/index/introducing-gpt-5-3-codex/
Chat with us
Hi, I'm Exzil's assistant. Want a post recommendation?