evmbench
smart-contract audits · detect
Disclosed vulnerabilities detected, full benchmark
65.7%
40 tasks · 120 disclosed vulnerabilities · Opus 4.6 orchestrator, MiniMax-M2.5 sub-agents, GPT-5 judge
Matched on a 10-task spread
24 / 28
clean sweeps on curves, abracadabra-money, canto, ethereumcreditguild, pooltogether, althea
CyberGym
C/C++ crashes (OSS-Fuzz) · detect → reproduce
Found the defect, bug described
9 / 15
Found it fully blind, source only
1 / 10
Reproduced byte-exact
1×
no payload-creation tooling given on this benchmark, so strict reproduction understates what the agent actually finds