Results

What KAI detects, and what it reproduces

Two benchmarks, one honest split: it finds more than it reproduces. Lime is detection: naming the bug. Copper is the harder bar: landing a byte-exact crash.

evmbench smart-contract audits · detect

Disclosed vulnerabilities detected, full benchmark 65.7%

40 tasks · 120 disclosed vulnerabilities · Opus 4.6 orchestrator, MiniMax-M2.5 sub-agents, GPT-5 judge

Matched on a 10-task spread 24 / 28

clean sweeps on curves, abracadabra-money, canto, ethereumcreditguild, pooltogether, althea

CyberGym C/C++ crashes (OSS-Fuzz) · detect → reproduce

Found the defect, bug described 9 / 15

Found it fully blind, source only 1 / 10

Reproduced byte-exact 1×

no payload-creation tooling given on this benchmark, so strict reproduction understates what the agent actually finds

The numbers move with the model spread, which is why we test several. The result comes from the parallel team and the feedback loop. That is the part worth building on.