Introducing Kai-Bench: Live Model Benchmarking

We're making KAI's internal model evaluations public. Kai-Bench is a live benchmark that tests LLM model configurations against real smart contract bounties on cantina.xyz and shows you the results.

What You'll See

The benchmark tracks an end-to-end exploit pipeline: how many candidate vulnerabilities each model configuration generates, how many survive verification, how many get approved, and how much bounty they've earned. Real code, real bounties, real money.

Leaderboard

The leaderboard ranks model configurations by a composite score. Sort by candidates, verified exploits, approved findings, or total bounty earned. A cumulative chart shows how each configuration performs over time. Toggle between verified exploit count and bounty earned.

Model Details

Click any configuration to see its full breakdown: which repositories it was tested against, per-repo exploit counts, and the actual vulnerabilities it discovered. Bounty-earning exploits show full details including severity, reasoning, and the suggested fix diff. Verified exploits that haven't earned bounty yet are shown redacted until submission.

Why We Built This

Choosing the right model matters. Different LLM configurations produce dramatically different results on security tasks. Kai-Bench gives you (and us) a transparent, continuously updated view of which models actually find real bugs in production code.