EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security
Exzil Calanza
••9 minutes read
Cybersecurity & Decentralized Finance
EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security
OpenAI and Paradigm’s open-source EVMbench framework systematically quantifies a severe security gap: frontier AI models exploit 72.2% of critical smart contract vulnerabilities while detecting only 45.6% — fundamentally altering the threat calculus for decentralized financial infrastructure securing over $100 billion in cryptographic assets.
EVMbench Benchmark Results
AI Agent Performance Across Smart Contract Security Tasks
0
Exploit Success Rate (GPT-5.3-Codex)
↑ From <20% to 72.2% in 12 months [2]
0
Detection Rate (Claude Opus 4.6)
↓ 26.6 points below exploit rate [2]
0
Patch Success Rate (GPT-5.3-Codex)
↓ Hardest task: preserving all functionality [2]
0
Real-World Vulnerabilities Tested
→ From 40 real security audits [6]
The $100 Billion Attack Surface: Smart Contract Vulnerability at Scale
The EVMbench framework emerges at a critical inflection point for decentralized finance. Public blockchains have evolved from experimental distributed ledgers into mature financial infrastructure. These networks routinely secure over $100 billion in cryptographic assets, while stablecoins settle trillions of dollars in transactional value monthly — operating at scales directly comparable to the world’s largest traditional payment networks. [2] The core appeal of decentralized finance lies in the immutability and execution speed of smart contracts, which guarantee permissionless transactions. However, these exact properties ensure that code vulnerabilities have immediate, catastrophic, and mathematically irreversible financial consequences. [2]
The magnitude of this threat is quantified by recent loss data. In 2025 alone, malicious actors drained an estimated $3.4 billion from blockchain platforms, with three sophisticated breaches accounting for nearly 70% of total recorded losses. [3] The exploitation of the Bybit exchange, resulting in approximately $1.5 billion in stolen Ethereum tokens, cemented 2025 as one of the most devastating years for cryptographic theft in history. [3]
Compounding this structural vulnerability is the acceleration of “vibe-coding” — a development methodology wherein software is rapidly generated, iterated, and deployed by autonomous AI coding assistants with extremely thin human review layers. [4] This paradigm democratizes software creation while simultaneously introducing critical security vulnerabilities that can destroy decentralized protocols. A $1.78 million exploit on the Moonwell protocol was directly attributed to production Solidity code generated with AI assistance, illustrating the risk when AI writes production code and human maintainers approve it without rigorous auditing. [4]
EVMbench Architecture: A Rigorous, Real-World Evaluation Framework
To systematically quantify the dual-use risk of frontier AI models operating within these unforgiving environments, OpenAI partnered with the crypto-focused investment firm Paradigm to build EVMbench — a rigorous, open-source benchmarking framework designed specifically to evaluate AI agent capabilities in autonomous vulnerability detection, patching, and exploitation within the Ethereum Virtual Machine (EVM). [6]
EVMbench deliberately avoids synthetic datasets, instead curating 120 complex vulnerabilities from 40 real-world security audits. [6] The majority of these vulnerabilities were sourced from competitive open auditing platforms such as Code4rena, ensuring AI agents are tested against the exact types of subtle, logic-based flaws that professional human auditors struggle to identify. [6]
The framework requires AI agents to operate across three distinct evaluation modes within a containerized, reproducible local Ethereum execution environment:
Detect Mode: Audit expansive smart contract repositories and identify specific ground-truth vulnerabilities based on strict recall metrics.
Patch Mode: Modify vulnerable contracts to eliminate exploit vectors without breaking compilation, edge cases, or intended protocol functionality.
Exploit Mode: Execute end-to-end fund-draining attacks in sandboxed EVM environments via deterministic transaction state changes without human intervention.
To eliminate subjective grading, EVMbench utilizes programmatic evaluation based on deterministic transaction state changes and transaction replay. [2]
Benchmark Results
EVMbench Evaluation: Top Model Performance by Mode
Evaluation Mode
Top Model
Success Rate
Operational Requirement
Exploit
GPT-5.3-Codex
72.2%
Execute end-to-end fund-draining attacks via deterministic state changes
Detect
Claude Opus 4.6
45.6%
Audit repositories and identify ground-truth vulnerabilities
Patch
GPT-5.3-Codex
41.5%
Fix vulnerable contracts without breaking intended functionality
The Exploitation Asymmetry: Offense Outpaces Defense
The EVMbench results reveal a severe structural “security gap” within the current generation of foundation models: frontier AI is significantly more adept at weaponizing cryptographic code than at auditing or repairing it. [11]
The rate of improvement in offensive capability is particularly concerning. When Paradigm and OpenAI initially conceptualized the project, leading models could exploit fewer than 20% of critical, fund-draining Code4rena bugs. [8] Six months before launch, GPT-5 achieved a 31.9% exploit success rate. [13] By early 2026, the optimized GPT-5.3-Codex reached 72.2%. [13]
The EVMbench documentation details instances where an autonomous GPT-5.2 agent independently discovered and executed a multi-step flash loan attack, draining a test vault’s entire balance in a single transaction without human guidance, step-by-step instructions, or preliminary hints. [9]
The discrepancy between offensive (72.2%) and defensive (41.5%) capabilities stems not from a fundamental lack of reasoning, but from the computational complexity of open-ended search spaces required for auditing. [2] In detection mode, AI agents demonstrate a consistent tendency to halt analysis after identifying a single irregularity, failing to conduct exhaustive audits of entire codebases. [3] In patch mode, agents struggle because fixing a vulnerability requires preserving every interconnected function, including obscure edge cases the agent may not fully comprehend. [9]
Capability Progression
AI Exploit Capability: 12-Month Escalation Timeline
<div class="pc3-stat__value" data-value="0
Initial Exploit Rate (Early 2025)
→ Pre-EVMbench baseline models [8]
0
GPT-5 Exploit Rate (Mid-2025)
↑ 60%+ improvement from baseline [13]
0
GPT-5.3-Codex (Early 2026)
↑ 126% gain over GPT-5 [13]
0
With Heuristic Hints
↑ Bottleneck is search, not reasoning [12]
The Search Bottleneck: Reasoning Versus Repository Navigation
A critical empirical finding concealed within the EVMbench data fundamentally reframes the nature of the security gap. When agents were provided with minor heuristic hints regarding the specific location of a vulnerability, exploit success rates surged from 63% to 96%, while patch success rates jumped from 39% to 94%. [1]
This indicates that the bottleneck in AI-driven cybersecurity is not cognitive skill, but rather the architectural mechanics of repository search and attention allocation. [1] The models possess sufficient reasoning capability to both exploit and repair complex vulnerabilities — the limiting factor is their ability to navigate large codebases and identify the precise point of failure within thousands of lines of interconnected logic.
This finding carries profound implications for the cybersecurity industry. It suggests that as search algorithms, code navigation tools, and context window architectures improve, the offensive capabilities already demonstrated at 72.2% will transfer directly to defensive operations. The exploitation asymmetry is not a permanent structural feature of AI but rather a transient limitation of current agentic search infrastructure.
Grounding Agents in Fiat Rails: The Tempo Stablechain Integration
EVMbench deliberately extends beyond speculative DeFi protocols by incorporating source code from the active security audit of the Tempo blockchain. [6] Tempo is an L1 blockchain co-developed by Stripe and Paradigm, engineered specifically for high-throughput, low-cost stablecoin payments designed to interface with traditional financial institutions. [16]
Backed by design input from Visa, Shopify, OpenAI, Mastercard, and UBS, Tempo guarantees extremely low, stable fees — targeting one-tenth of a cent per transaction even during extreme network congestion. [7] Unlike Ethereum or Solana, which require volatile native gas tokens, Tempo allows users to pay transaction fees directly in any USD-denominated stablecoin. [16]
This integration forces AI models to evaluate payment-oriented smart contracts where logic errors could directly affect institutional fiat capital. [11] The US Treasury has identified stablecoins as a market opportunity with the potential to exceed $2 trillion in market capitalization. [16] Ensuring that the smart contracts governing this liquidity pool are resilient against autonomous AI exploitation is a foundational requirement for the future of digital commerce.
The strategic significance deepens when considering the convergence of autonomous AI and programmable payment infrastructure. AI agents inherently need digital-native payment rails to operate independently — they cannot open bank accounts or navigate SWIFT transfers, but they can hold cryptographic keys and sign transactions. By stress-testing agents on payment-first L1 networks like Tempo, this benchmark is preemptively validating the exact infrastructure that will handle autonomous machine-to-machine payments at scale. [3]
“AI models are significantly more adept at weaponizing cryptographic code than they are at auditing, understanding, or repairing it. The glaring discrepancy stems not from a fundamental lack of reasoning, but from the immense computational complexity of the open-ended search space required for defensive auditing.”
— OpenAI/Paradigm EVMbench Technical Report [2]
Aardvark: Continuous Agentic Defense at Machine Speed
Recognizing that models are approaching capability thresholds for executing zero-day exploits against well-defended systems, OpenAI has deployed a multi-layered defensive strategy anchored by “Aardvark” — an autonomous, GPT-5 powered agentic security researcher in expanded private beta. [17]
Aardvark represents a fundamental architectural departure from legacy cybersecurity tools. It bypasses traditional static program analysis techniques such as deterministic fuzzing or software composition analysis, relying instead on pure large language model reasoning and autonomous tool-use to evaluate complex code behavior. [18]
Operating continuously across codebases on a 24/7/365 basis, Aardvark builds holistic contextual threat models, scans every new developer commit, autonomously sandboxes suspected exploits to confirm their validity through active transaction replay, and generates review-ready patches for human maintainers. [5] In benchmark testing against verified repositories, Aardvark identified 92% of known and synthetically introduced vulnerabilities. [17]
To accelerate the development of automated defense and ensure defensive capabilities keep pace with offensive ones, OpenAI committed $10 million in API credits to open-source cybersecurity research and plans to offer free Aardvark coverage to select non-commercial open-source repositories. [11] This commitment underscores a sobering reality: human maintainers can no longer defend against machine-speed vulnerability discovery without automated, agentic countermeasures. [18]
Defensive Infrastructure
Aardvark vs. Traditional Security Approaches
Dimension
Traditional Auditing
Aardvark (GPT-5 Agent)
Analysis Method
Static program analysis, deterministic fuzzing
LLM-driven reasoning + autonomous tool-use
Coverage
Periodic, point-in-time audits
Continuous 24/7/365 monitoring
Vulnerability Detection
Known patterns and signatures
Complex behavioral analysis, 92% detection rate
Exploit Confirmation
Manual reproduction required
Autonomous sandbox + transaction replay
Patch Generation
Human-authored after discovery
Auto-generated review-ready patches
Scalability
Limited by auditor availability
Scales across entire codebases simultaneously
Response Time
Days to weeks
Real-time per commit
Open-Source Ecosystem Stratification
The dual-use nature of frontier models capable of both sophisticated patching (Aardvark) and autonomous exploitation (GPT-5.3-Codex) threatens to stratify the global open-source software ecosystem. [17] Malicious actors utilizing optimized open-source foundation models can weaponize zero-day discovery at unprecedented, industrialized scale. [19]
Consequently, maintaining critical repositories will require equivalent AI-driven defensive agents operating continuously. [18] OpenAI’s initiative to provide free Aardvark coverage and API credits to critical non-commercial infrastructure is a direct acknowledgment of this new reality. [15] The ecosystem will inevitably fracture into repositories protected by elite autonomous AI defenders and those left vulnerable to industrialized exploitation.
The liability implications are equally profound. When an AI agent autonomously discovers an exploit vector, writes the necessary logic, and executes a multi-million dollar drain — all from a single broad prompt, with no specific human intent beyond the initial instruction — legacy frameworks for assigning cybersecurity liability fail entirely. [15] This accelerates the necessity for interrupt-driven deployment architectures and mandatory human-in-the-loop authorization for any autonomous state-changing operation with financial consequences.
Key Takeaways
Exploitation Outpaces Defense: Frontier AI models exploit 72.2% of critical smart contract vulnerabilities but detect only 45.6% and patch only 41.5% — creating a severe security asymmetry in decentralized finance. [2]
Rapid Capability Escalation: AI exploit success rates surged from under 20% to 72.2% within twelve months, with GPT-5.3-Codex autonomously executing multi-step flash loan attacks without human guidance. [8][9][13]
Search, Not Reasoning, Is the Bottleneck: When given minor location hints, agent exploit rates jump to 96% and patch rates to 94% — indicating that improved code navigation tools will close the security gap. [1]
Payment Infrastructure Under Test: Tempo stablechain code from Stripe/Paradigm is included in EVMbench, stress-testing the exact smart contracts that will handle $2 trillion in stablecoin liquidity. [6][16]
Automated Defense Is Now Mandatory: Aardvark achieves 92% vulnerability detection with continuous monitoring, but the broader ecosystem faces stratification between AI-defended and AI-vulnerable repositories. [17]
Vibe-Coding Creates Systemic Risk: AI-generated production code with negligent human review is introducing exploitable vulnerabilities at scale, as demonstrated by the $1.78M Moonwell exploit. [4]
[10] “OpenAI Unveils EVMbench Benchmark to Evaluate AI in Smart Contracts,” MEXC News, accessed February 22, 2026. https://www.mexc.co/en-NG/news/750964