Uncategorized

EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security

Exzil Calanza

• February 21, 2026 • 9 minutes read

Cybersecurity & Decentralized Finance

EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security

OpenAI and Paradigm’s open-source EVMbench framework systematically quantifies a severe security gap: frontier AI models exploit 72.2% of critical smart contract vulnerabilities while detecting only 45.6% — fundamentally altering the threat calculus for decentralized financial infrastructure securing over $100 billion in cryptographic assets.

Exploit Success Rate (GPT-5.3-Codex)

↑ From <20% to 72.2% in 12 months [2]

Detection Rate (Claude Opus 4.6)

↓ 26.6 points below exploit rate [2]

Patch Success Rate (GPT-5.3-Codex)

↓ Hardest task: preserving all functionality [2]

Real-World Vulnerabilities Tested

→ From 40 real security audits [6]

The $100 Billion Attack Surface: Smart Contract Vulnerability at Scale

The EVMbench framework emerges at a critical inflection point for decentralized finance. Public blockchains have evolved from experimental distributed ledgers into mature financial infrastructure. These networks routinely secure over $100 billion in cryptographic assets, while stablecoins settle trillions of dollars in transactional value monthly — operating at scales directly comparable to the world’s largest traditional payment networks. [2] The core appeal of decentralized finance lies in the immutability and execution speed of smart contracts, which guarantee permissionless transactions. However, these exact properties ensure that code vulnerabilities have immediate, catastrophic, and mathematically irreversible financial consequences. [2]

The magnitude of this threat is quantified by recent loss data. In 2025 alone, malicious actors drained an estimated $3.4 billion from blockchain platforms, with three sophisticated breaches accounting for nearly 70% of total recorded losses. [3] The exploitation of the Bybit exchange, resulting in approximately $1.5 billion in stolen Ethereum tokens, cemented 2025 as one of the most devastating years for cryptographic theft in history. [3]

Compounding this structural vulnerability is the acceleration of “vibe-coding” — a development methodology wherein software is rapidly generated, iterated, and deployed by autonomous AI coding assistants with extremely thin human review layers. [4] This paradigm democratizes software creation while simultaneously introducing critical security vulnerabilities that can destroy decentralized protocols. A $1.78 million exploit on the Moonwell protocol was directly attributed to production Solidity code generated with AI assistance, illustrating the risk when AI writes production code and human maintainers approve it without rigorous auditing. [4]

EVMbench Architecture: A Rigorous, Real-World Evaluation Framework

To systematically quantify the dual-use risk of frontier AI models operating within these unforgiving environments, OpenAI partnered with the crypto-focused investment firm Paradigm to build EVMbench — a rigorous, open-source benchmarking framework designed specifically to evaluate AI agent capabilities in autonomous vulnerability detection, patching, and exploitation within the Ethereum Virtual Machine (EVM). [6]

EVMbench deliberately avoids synthetic datasets, instead curating 120 complex vulnerabilities from 40 real-world security audits. [6] The majority of these vulnerabilities were sourced from competitive open auditing platforms such as Code4rena, ensuring AI agents are tested against the exact types of subtle, logic-based flaws that professional human auditors struggle to identify. [6]

The framework requires AI agents to operate across three distinct evaluation modes within a containerized, reproducible local Ethereum execution environment:

Detect Mode: Audit expansive smart contract repositories and identify specific ground-truth vulnerabilities based on strict recall metrics.
Patch Mode: Modify vulnerable contracts to eliminate exploit vectors without breaking compilation, edge cases, or intended protocol functionality.
Exploit Mode: Execute end-to-end fund-draining attacks in sandboxed EVM environments via deterministic transaction state changes without human intervention.

To eliminate subjective grading, EVMbench utilizes programmatic evaluation based on deterministic transaction state changes and transaction replay. [2]

Evaluation Mode	Top Model	Success Rate	Operational Requirement
Exploit	GPT-5.3-Codex	72.2%	Execute end-to-end fund-draining attacks via deterministic state changes
Detect	Claude Opus 4.6	45.6%	Audit repositories and identify ground-truth vulnerabilities
Patch	GPT-5.3-Codex	41.5%	Fix vulnerable contracts without breaking intended functionality

The Exploitation Asymmetry: Offense Outpaces Defense

The EVMbench results reveal a severe structural “security gap” within the current generation of foundation models: frontier AI is significantly more adept at weaponizing cryptographic code than at auditing or repairing it. [11]

The rate of improvement in offensive capability is particularly concerning. When Paradigm and OpenAI initially conceptualized the project, leading models could exploit fewer than 20% of critical, fund-draining Code4rena bugs. [8] Six months before launch, GPT-5 achieved a 31.9% exploit success rate. [13] By early 2026, the optimized GPT-5.3-Codex reached 72.2%. [13]

The EVMbench documentation details instances where an autonomous GPT-5.2 agent independently discovered and executed a multi-step flash loan attack, draining a test vault’s entire balance in a single transaction without human guidance, step-by-step instructions, or preliminary hints. [9]

The discrepancy between offensive (72.2%) and defensive (41.5%) capabilities stems not from a fundamental lack of reasoning, but from the computational complexity of open-ended search spaces required for auditing. [2] In detection mode, AI agents demonstrate a consistent tendency to halt analysis after identifying a single irregularity, failing to conduct exhaustive audits of entire codebases. [3] In patch mode, agents struggle because fixing a vulnerability requires preserving every interconnected function, including obscure edge cases the agent may not fully comprehend. [9]

<div class="pc3-stat__value" data-value="0

Initial Exploit Rate (Early 2025)

→ Pre-EVMbench baseline models [8]

GPT-5 Exploit Rate (Mid-2025)

↑ 60%+ improvement from baseline [13]

GPT-5.3-Codex (Early 2026)

↑ 126% gain over GPT-5 [13]

With Heuristic Hints

↑ Bottleneck is search, not reasoning [12]

The Search Bottleneck: Reasoning Versus Repository Navigation

A critical empirical finding concealed within the EVMbench data fundamentally reframes the nature of the security gap. When agents were provided with minor heuristic hints regarding the specific location of a vulnerability, exploit success rates surged from 63% to 96%, while patch success rates jumped from 39% to 94%. [1]

This indicates that the bottleneck in AI-driven cybersecurity is not cognitive skill, but rather the architectural mechanics of repository search and attention allocation. [1] The models possess sufficient reasoning capability to both exploit and repair complex vulnerabilities — the limiting factor is their ability to navigate large codebases and identify the precise point of failure within thousands of lines of interconnected logic.

This finding carries profound implications for the cybersecurity industry. It suggests that as search algorithms, code navigation tools, and context window architectures improve, the offensive capabilities already demonstrated at 72.2% will transfer directly to defensive operations. The exploitation asymmetry is not a permanent structural feature of AI but rather a transient limitation of current agentic search infrastructure.

Grounding Agents in Fiat Rails: The Tempo Stablechain Integration

EVMbench deliberately extends beyond speculative DeFi protocols by incorporating source code from the active security audit of the Tempo blockchain. [6] Tempo is an L1 blockchain co-developed by Stripe and Paradigm, engineered specifically for high-throughput, low-cost stablecoin payments designed to interface with traditional financial institutions. [16]

Backed by design input from Visa, Shopify, OpenAI, Mastercard, and UBS, Tempo guarantees extremely low, stable fees — targeting one-tenth of a cent per transaction even during extreme network congestion. [7] Unlike Ethereum or Solana, which require volatile native gas tokens, Tempo allows users to pay transaction fees directly in any USD-denominated stablecoin. [16]

This integration forces AI models to evaluate payment-oriented smart contracts where logic errors could directly affect institutional fiat capital. [11] The US Treasury has identified stablecoins as a market opportunity with the potential to exceed $2 trillion in market capitalization. [16] Ensuring that the smart contracts governing this liquidity pool are resilient against autonomous AI exploitation is a foundational requirement for the future of digital commerce.

The strategic significance deepens when considering the convergence of autonomous AI and programmable payment infrastructure. AI agents inherently need digital-native payment rails to operate independently — they cannot open bank accounts or navigate SWIFT transfers, but they can hold cryptographic keys and sign transactions. By stress-testing agents on payment-first L1 networks like Tempo, this benchmark is preemptively validating the exact infrastructure that will handle autonomous machine-to-machine payments at scale. [3]

“AI models are significantly more adept at weaponizing cryptographic code than they are at auditing, understanding, or repairing it. The glaring discrepancy stems not from a fundamental lack of reasoning, but from the immense computational complexity of the open-ended search space required for defensive auditing.”

— OpenAI/Paradigm EVMbench Technical Report [2]

Aardvark: Continuous Agentic Defense at Machine Speed

Recognizing that models are approaching capability thresholds for executing zero-day exploits against well-defended systems, OpenAI has deployed a multi-layered defensive strategy anchored by “Aardvark” — an autonomous, GPT-5 powered agentic security researcher in expanded private beta. [17]

Aardvark represents a fundamental architectural departure from legacy cybersecurity tools. It bypasses traditional static program analysis techniques such as deterministic fuzzing or software composition analysis, relying instead on pure large language model reasoning and autonomous tool-use to evaluate complex code behavior. [18]

Operating continuously across codebases on a 24/7/365 basis, Aardvark builds holistic contextual threat models, scans every new developer commit, autonomously sandboxes suspected exploits to confirm their validity through active transaction replay, and generates review-ready patches for human maintainers. [5] In benchmark testing against verified repositories, Aardvark identified 92% of known and synthetically introduced vulnerabilities. [17]

To accelerate the development of automated defense and ensure defensive capabilities keep pace with offensive ones, OpenAI committed $10 million in API credits to open-source cybersecurity research and plans to offer free Aardvark coverage to select non-commercial open-source repositories. [11] This commitment underscores a sobering reality: human maintainers can no longer defend against machine-speed vulnerability discovery without automated, agentic countermeasures. [18]

Dimension	Traditional Auditing	Aardvark (GPT-5 Agent)
Analysis Method	Static program analysis, deterministic fuzzing	LLM-driven reasoning + autonomous tool-use
Coverage	Periodic, point-in-time audits	Continuous 24/7/365 monitoring
Vulnerability Detection	Known patterns and signatures	Complex behavioral analysis, 92% detection rate
Exploit Confirmation	Manual reproduction required	Autonomous sandbox + transaction replay
Patch Generation	Human-authored after discovery	Auto-generated review-ready patches
Scalability	Limited by auditor availability	Scales across entire codebases simultaneously
Response Time	Days to weeks	Real-time per commit

Open-Source Ecosystem Stratification

The dual-use nature of frontier models capable of both sophisticated patching (Aardvark) and autonomous exploitation (GPT-5.3-Codex) threatens to stratify the global open-source software ecosystem. [17] Malicious actors utilizing optimized open-source foundation models can weaponize zero-day discovery at unprecedented, industrialized scale. [19]

Consequently, maintaining critical repositories will require equivalent AI-driven defensive agents operating continuously. [18] OpenAI’s initiative to provide free Aardvark coverage and API credits to critical non-commercial infrastructure is a direct acknowledgment of this new reality. [15] The ecosystem will inevitably fracture into repositories protected by elite autonomous AI defenders and those left vulnerable to industrialized exploitation.

The liability implications are equally profound. When an AI agent autonomously discovers an exploit vector, writes the necessary logic, and executes a multi-million dollar drain — all from a single broad prompt, with no specific human intent beyond the initial instruction — legacy frameworks for assigning cybersecurity liability fail entirely. [15] This accelerates the necessity for interrupt-driven deployment architectures and mandatory human-in-the-loop authorization for any autonomous state-changing operation with financial consequences.

Key Takeaways

Exploitation Outpaces Defense: Frontier AI models exploit 72.2% of critical smart contract vulnerabilities but detect only 45.6% and patch only 41.5% — creating a severe security asymmetry in decentralized finance. [2]
Rapid Capability Escalation: AI exploit success rates surged from under 20% to 72.2% within twelve months, with GPT-5.3-Codex autonomously executing multi-step flash loan attacks without human guidance. [8][9][13]
Search, Not Reasoning, Is the Bottleneck: When given minor location hints, agent exploit rates jump to 96% and patch rates to 94% — indicating that improved code navigation tools will close the security gap. [1]
Payment Infrastructure Under Test: Tempo stablechain code from Stripe/Paradigm is included in EVMbench, stress-testing the exact smart contracts that will handle $2 trillion in stablecoin liquidity. [6][16]
Automated Defense Is Now Mandatory: Aardvark achieves 92% vulnerability detection with continuous monitoring, but the broader ecosystem faces stratification between AI-defended and AI-vulnerable repositories. [17]
Vibe-Coding Creates Systemic Risk: AI-generated production code with negligent human review is introducing exploitable vulnerabilities at scale, as demonstrated by the $1.78M Moonwell exploit. [4]

References

[1] “EVMbench: Evaluating AI Agents on Smart Contract Security,” OpenAI Technical Report, accessed February 22, 2026. https://cdn.openai.com/evmbench/evmbench.pdf
[2] “Introducing EVMbench,” OpenAI, accessed February 22, 2026. https://openai.com/index/introducing-evmbench/
[3] “Can AI Agents Boost Ethereum Security? OpenAI and Paradigm Created a Testing Ground,” Decrypt, accessed February 22, 2026. https://decrypt.co/358470/ai-agents-boost-ethereum-security-openai-paradigm-evmbench
[4] “OpenAI Drops EVMbench After Claude Vibe Code Disaster,” MEXC News, accessed February 22, 2026. https://www.mexc.com/news/753734
[5] “AI Security Newsletter — October 2025,” AISecHub, Medium, accessed February 22, 2026. https://medium.com/ai-security-hub/ai-security-newsletter-october-2025-b416f3e516e1
[6] “Morning Minute: OpenAI and Paradigm Turn Focus to Smart Contracts,” Decrypt, accessed February 22, 2026. https://decrypt.co/358569/morning-minute-openai-and-paradigm-turn-focus-to-smart-contracts
[7] “OpenAI and Paradigm’s EVMBench: The First Serious Test for AI Security Agents,” Beam AI, accessed February 22, 2026. https://beam.ai/agentic-insights/openai-and-paradigms-evmbench-the-first-serious-test-for-ai-security-agents
[8] “What is EVMbench? The New AI Standard for Smart Contract Security,” KuCoin, accessed February 22, 2026. https://www.kucoin.com/news/articles/what-is-evmbench-the-new-ai-standard-for-smart-contract-security
[9] “OpenAI Just Showed That AI Can Drain a Crypto Wallet… on Purpose,” eWEEK, accessed February 22, 2026. https://www.eweek.com/news/openai-crypto-wallet-neuron/
[10] “OpenAI Unveils EVMbench Benchmark to Evaluate AI in Smart Contracts,” MEXC News, accessed February 22, 2026. https://www.mexc.co/en-NG/news/750964
[11] “New benchmark shows AI agents can exploit most smart contract vulnerabilities on their own,” The Decoder, accessed February 22, 2026. https://the-decoder.com/new-benchmark-shows-ai-agents-can-exploit-most-smart-contract-vulnerabilities-on-their-own/
[12] “What Is Tempo? Stripe’s Upcoming Payments Stablechain,” CoinGecko, accessed February 22, 2026. https://www.coingecko.com/learn/what-is-tempo-stablechain
[13] “OpenAI Braces for AI Models That Could Breach Defenses,” BankInfoSecurity, accessed February 22, 2026. https://www.bankinfosecurity.com/openai-braces-for-ai-models-that-could-breach-defenses-a-30264
[14] “As Capabilities Advance Quickly OpenAI Warns of High Cybersecurity Risk of Future AI Models,” Security Boulevard, accessed February 22, 2026. https://securityboulevard.com/2025/12/as-capabilities-advance-quickly-openai-warns-of-high-cybersecurity-risk-of-future-ai-models/
[15] “Weaponized AI risk is ‘high,’ warns OpenAI — here’s the plan to stop it,” ZDNet, accessed February 22, 2026. https://www.zdnet.com/article/openai-warns-weaponized-ai/
[16] “OpenAI launches benchmarking system for securing crypto tokens and smart contracts,” CryptoBriefing, accessed February 22, 2026. https://cryptobriefing.com/ai-security-benchmarking-system/
[17] “OpenAI Aardvark Agentic GPT-5 Security Tool in Private Beta,” ETCentric, accessed February 22, 2026. https://www.etcentric.org/openai-aardvark-agentic-gpt-5-security-tool-in-public-beta/
[18] “OpenAI unleashes Aardvark security agent in private beta,” The Register, accessed February 22, 2026. https://www.theregister.com/2025/10/31/openai_aardvark_agentic_security/
[19] “Strengthening cyber resilience as AI capabilities advance,” OpenAI, accessed February 22, 2026. https://openai.com/index/strengthening-cyber-resilience/

EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security

EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security

AI Agent Performance Across Smart Contract Security Tasks

The $100 Billion Attack Surface: Smart Contract Vulnerability at Scale

EVMbench Architecture: A Rigorous, Real-World Evaluation Framework

EVMbench Evaluation: Top Model Performance by Mode

The Exploitation Asymmetry: Offense Outpaces Defense

AI Exploit Capability: 12-Month Escalation Timeline

The Search Bottleneck: Reasoning Versus Repository Navigation

Grounding Agents in Fiat Rails: The Tempo Stablechain Integration

Aardvark: Continuous Agentic Defense at Machine Speed

Aardvark vs. Traditional Security Approaches

Open-Source Ecosystem Stratification

Key Takeaways

References

EVMbench and the Exploitation Asymmetry: How AI Agents Are Reshaping Smart Contract Security

AI Agent Performance Across Smart Contract Security Tasks

The $100 Billion Attack Surface: Smart Contract Vulnerability at Scale

EVMbench Architecture: A Rigorous, Real-World Evaluation Framework

EVMbench Evaluation: Top Model Performance by Mode

The Exploitation Asymmetry: Offense Outpaces Defense

AI Exploit Capability: 12-Month Escalation Timeline

The Search Bottleneck: Reasoning Versus Repository Navigation

Grounding Agents in Fiat Rails: The Tempo Stablechain Integration

Aardvark: Continuous Agentic Defense at Machine Speed

Aardvark vs. Traditional Security Approaches

Open-Source Ecosystem Stratification

Key Takeaways

References

Related Reading

Related Reading

Sovereign AI in 2026: The WEF Framework, Strategic Interdependence, and South Korea’s Motif Foundation Model

MedAgentBench and the Clinical AI Frontier: Stanford’s Benchmark for Healthcare Agent Safety

Global Trade Fragmentation 2026: Protectionism, 18,000 Discriminatory Measures, and the Rise of South-South Commerce

Bangladesh 2026 Elections: BNP Landslide, the July Charter, and South Asia’s Democratic Reset

Stay in the loop