Results

What KAI detects, and what it reproduces

Two benchmarks, one honest split: it finds more than it reproduces. Lime is detection: naming the bug. Copper is the harder bar: landing a byte-exact crash.
evmbench smart-contract audits · detect
Disclosed vulnerabilities detected, full benchmark 65.7%
40 tasks · 120 disclosed vulnerabilities · Opus 4.6 orchestrator, MiniMax-M2.5 sub-agents, GPT-5 judge
Matched on a 10-task spread 24 / 28
clean sweeps on curves, abracadabra-money, canto, ethereumcreditguild, pooltogether, althea
CyberGym C/C++ crashes (OSS-Fuzz) · detect → reproduce
Found the defect, bug described 9 / 15
Found it fully blind, source only 1 / 10
Reproduced byte-exact
no payload-creation tooling given on this benchmark, so strict reproduction understates what the agent actually finds
The numbers move with the model spread, which is why we test several. The result comes from the parallel team and the feedback loop. That is the part worth building on.