Zero Solidity Specialization, #1 on EVMBench

Kai just scored 64.2% Detect Recall on EVMBench, OpenAI's benchmark for real-world smart contract vulnerability detection. That is 19 percentage points ahead of the next best system (Claude Opus 4.6 + Claude Code at 45.6%) and 25 points ahead of GPT-5.3 Codex. Across 40 real audit contests and 120 known vulnerabilities, Kai identified $74,707 in bounty-eligible findings.

Kai is a general purpose autonomous security agent that covers 10+ programming languages and 20+ frameworks. It has no Solidity fine-tuning, no hand-crafted prompts for EVM opcodes, no special-case heuristics for DeFi protocols. It runs the same multi-phase pipeline on a Rust codebase as it does on a Solidity protocol. It beat every system on the leaderboard not by knowing more about smart contracts, but by being better at the thing that actually matters in security: verifying that a finding is real before reporting it.

What EVMBench Actually Tests

EVMBench is a frontier evaluation benchmark created by OpenAI that tests whether AI agents can discover real vulnerabilities in Ethereum smart contracts. It is built from 40 real audit contests (Code4rena and Sherlock) containing 120 vulnerabilities that were found and validated by human security researchers in the wild. They are the kinds of issues that slip past teams of experienced auditors: subtle rounding errors in withdrawal batches, dimension mismatches in multi-hop oracle pricing, economic invariant violations that only surface under specific protocol states.

The benchmark measures two things. Detect Recall is the fraction of known vulnerabilities the agent finds. Detect Award is the total bounty value of those findings, weighted by real-world severity. High recall with low-severity findings is not impressive. You need to find the bugs that actually matter.

The Numbers

Metric	Score
Detect Recall	64.2%
Detect Award	$74,707
Audits Evaluated	40
Known Vulnerabilities	120

For context: the next best system on the leaderboard is Claude Opus 4.6 running with Claude Code, which achieved 45.6% recall. That is nearly 19 percentage points behind Kai. GPT-5.3 Codex reached 39.2%. Most other systems scored below 30%.

The full leaderboard is available on our benchmark page.

Why Verification Is the Whole Game

Here is something most people get wrong about AI security tools: finding candidate vulnerabilities is not hard. Any sufficiently large language model can scan a codebase and produce a list of things that look suspicious. The problem is that most of those findings are wrong.

In real audit contests, false positives are expensive. They waste reviewer time, erode trust, and bury the real issues under noise. The difference between a useful security tool and an annoying one is not how many candidates it generates. It is how many of those candidates are actually real.

This is where Kai's architecture diverges from the "prompt a model and hope" approach. Kai does not submit candidates. It submits verified findings.

The pipeline has four phases, and the third one is the key:

Research. The agent maps the target protocol's architecture: what DeFi category it belongs to (lending, DEX, options, etc.), what patterns are common in similar protocols, what classes of vulnerabilities have historically appeared in this design space. This is not a generic scan. It is a targeted investigation informed by the structure of the code.

Analysis. Multiple focused passes run across the codebase, each targeting a different vulnerability class: reentrancy, rounding errors, access control, oracle manipulation, economic invariants, and more. Each pass is deep rather than broad.

Verification. This is the step that changes everything. A dedicated verifier agent independently traces each candidate finding through the source code, confirming the bug is reachable and the impact is real. It is not checking syntax or running a linter. It is reasoning about execution paths, state transitions, and economic consequences. Candidates that do not survive verification are filtered out before any report is generated.

Most AI security approaches skip this entirely, or treat it as a post-processing filter. Kai treats it as a first-class reasoning stage with its own agent, its own context, and its own chain of thought.

Exploit PoC. For verified findings, the agent writes a concrete proof-of-concept test (typically a Foundry test) that demonstrates the vulnerability end to end.

Example: Wildcat Protocol Withdrawal Batch Rounding

One of Kai's findings in the Wildcat Protocol audit illustrates the depth of this pipeline. The agent discovered that the withdrawal batch payment system uses half-up rounding (rayDiv) when converting available liquidity, but floor rounding when converting burned amounts back. Once the market's scaleFactor exceeds 2x RAY, normalizedAmountPaid exceeds availableLiquidity, letting withdrawers consume reserved funds.

The agent traced this through WildcatMarketBase.sol, identified the rounding composition as the root cause, and produced a Foundry test that deposited funds, borrowed to create delinquency, warped time until the scale factor exceeded the threshold, and triggered the withdrawal batch to verify the accounting corruption. This finding was worth $20,252.

Example: Noya Protocol Multi-Hop Oracle Mispricing

In the Noya Protocol audit, Kai found that the multi-hop pricing route in NoyaValueOracle.sol always passes the original asset to each hop instead of the intermediate token from the previous hop. For routes with two or more hops, every price conversion after the first produces dimension-mismatched amounts, meaning the oracle returns fundamentally wrong prices.

The agent deployed mock oracles with known prices (A to B: 2x, B to C: 3x, C to D: 4x), expected 24x for A to D, and confirmed the buggy code returned a completely different value. Any vault relying on multi-hop price routes could be drained by depositing mispriced tokens and withdrawing at the correct value.

What This Means

EVMBench results demonstrate that Kai's autonomous security pipeline can find real vulnerabilities at a rate and depth that significantly exceeds what individual model calls achieve, regardless of which frontier model is used. The structured multi-phase approach with dedicated verification catches bugs that require deep protocol understanding and multi-step reasoning.

Explore the full benchmark results, leaderboard comparisons, and more example findings on our benchmark page.