Codebases that improve themselves

By Kai Team · Published 2026-04-29

The last two years of AI coding tools have been a story about generation getting better. The autocomplete got smarter. The chat windows got context-aware. The agents started running in a loop. Every release was a little faster, a little more accurate, a little more impressive in the demo.

The generators are good now. That part of the problem is largely solved, and it will keep getting better on a curve we did not start and cannot accelerate. The interesting question is what comes next.

What comes next is everything downstream of generation. Every engineer using these tools today knows the feeling. The agent finishes, drops a thousand lines in your lap, and says some version of "looks good to me." Now you are the verifier. You are the one reading code you did not write, deciding if it actually works, holding the responsibility no tool will hold for you. The work that got automated was the easy part. The work that got harder, the part that compounds across every PR, was reading and trusting the output, keeping the codebase coherent, catching the things the test suite missed.

This is the asymmetry the field is moving into. Generation is cheap and getting cheaper. Verification is expensive and not getting any cheaper. Most companies in the space are still racing to make a faster generator, which is a fine race to be in but is not the race that determines what an autonomous engineering system actually looks like.

We built Kai to take the other side of that bet. Kai is the team lead for the AI-augmented engineering org. It works alongside Cursor, Claude Code, Devin, and the humans on your team.

The IC agents in the stack do what they are good at: they sit at the keyboard, in the terminal, on the ticket, and they execute. Kai is the layer above. It holds the org-level context (codebase, conventions, people, history, production metrics), and because it holds that context, it can do the things a team lead does. It decides what work matters, dispatches the verification, reviews the output, and merges what survives.

Three things make this possible.

The first is verifier first, generator second. Every change Kai ships comes with proof that the change actually works. A vulnerability comes with a working exploit and a fix re-tested against that exploit. An optimization comes with a benchmark on real hardware. A refactor comes with test coverage on the things the original suite did not cover. If Kai cannot prove a change, Kai does not ship it. That single constraint, applied recursively, changes almost everything about what an autonomous coding system has to look like.

The second is memory that compounds. Most AI coding tools peak the day you install them, because they have no mechanism for getting smarter about your codebase specifically. They have prior knowledge from training and ephemeral context from your prompt. Kai has prior knowledge, ephemeral context, and persistent learnings that grow every time it works on your repo. It reads the commit history, learns the team's conventions, anchors what it has learned to specific files, symbols, and SHAs. Kai's hundredth PR in your repo reads more like your team than its tenth, which reads more like your team than its first. This is the only part of an AI coding system that gets meaningfully better over time without a model swap.

The third is the altitude at which Kai operates. Most agentic systems today are individual contributors. Cursor is a hands-on-keyboard agent. Claude Code is a hands-on-terminal agent. Devin is a hands-on-ticket agent. They are each very good at the work you point them at, but none of them manage anything. They execute. They do not coordinate, they do not hold the picture above the work. Kai is what makes their output coherent at the team and codebase level.

These three together are the architecture: a team lead with a verifier-first principle, a memory that compounds, and the altitude to coordinate across a real engineering organization. The rest of this post is about what we found when we built it, what it can do today, and what we are still figuring out.

What makes Kai different from a coding agent

A coding agent waits for a task. You point it at a problem, it returns a solution. The loop closes when you accept the output. Kai operates outside that loop. Three things make it different.

Kai is proactive. A coding agent does not know what to work on until you tell it. Kai does. Because it holds the workspace context (the codebase, the conventions, the production metrics, the open work, the patterns drifting across recent PRs), it surfaces the work that needs to happen before anyone has filed a ticket for it. The vulnerability that the test suite would not catch. The endpoint that has gotten 30% slower in the last month. The convention drift accumulating across a service. Kai opens the PR. You decide whether to merge it. The work begins from Kai's read of what matters, not from a prompt.

Kai watches every change, including its own. A coding agent's job ends at the PR. Kai's job spans the whole codebase, continuously. It oversees the quality of its own output and the quality of everything else landing in the repo, from human commits and from other agents. Because it lives in the workspace, this oversight does not feel like a separate review process bolted on. It feels organic: a teammate who reads the diffs that come in, recognizes when something doesn't fit the team's patterns, and quietly straightens the codebase as part of the same daily rhythm. The repo stays coherent without anyone running a special pass on it.

Kai handles the work that would otherwise slow you down. Fast-pace AI-augmented development works because most changes can flow without deep scrutiny. A feature, a fix, a refactor, a test. They ship and the team keeps moving. The problem is that some changes cannot be treated this way. Kai is the system that absorbs that work in the background. It identifies the changes that warrant depth and scales inference into them: more iterations, separated evaluators, real benchmarks, sandboxed exploits. The depth is real because the verification is real. A patch that does not stop the exploit is not a patch. An optimization that is not faster on the hardware is not an optimization. The team keeps shipping. Kai makes sure the work that needs to be right, is right.

What happens after the code lands

A modern engineering team is shipping more code than ever, from more sources than ever. Cursor is open in one tab, Claude Code is running in a terminal, Copilot is suggesting completions inline, Devin is running a long task in the background, and humans are merging the result. This is good. It is the right direction. The amount of correct, working code a team can produce per day has gone up by an order of magnitude in two years.

The problem is that everything downstream of generation has not kept pace. The output of agents and humans together is now arriving faster than the team can verify, optimize, and unify it.

This is the gap Kai is built to close. Not by replacing the agents, and not by replacing the humans. By running continuously across what both produce.

The first place the gap is widening fastest is security. When a codebase is absorbing dozens of PRs a week from a mix of sources, the surface area for vulnerabilities grows faster than any review process can keep up with. Most code, whether agent-written or human-written, is not insecure on purpose. But the difference between correct and exploitable is often one missing check, in authentication, signature verification, input handling, or somewhere similar. Static scanners flag thousands of false positives. Real security review does not scale to the rate of change. The gap between vulnerable code shipped and vulnerable code found is widening across the industry, and most of the existing tooling makes the gap worse, not better, because every new finding has to be triaged by someone whose attention is already stretched.

Kai treats exploitability as the only signal that matters. Most candidate findings die in a sandbox before any human sees them. The ones that survive arrive with a working exploit attached, and a fix that has been re-tested against the same exploit. The signal is real because the test was real.

The second is performance and correctness drift. Most code, again from any source, is functionally correct on the happy path and quietly inefficient or quietly wrong somewhere else. The function works. The function is slow. The query is correct. The query causes a full table scan. The kernel runs. The kernel runs at 5% of what the hardware can do. None of this is caught by tests, because tests check for correctness, not for performance, not for resource use, not for the existence of a better solution. Over months, a codebase that absorbs a high volume of changes gets slower, more expensive, and harder to optimize, because the easy wins have been silently ruled out by patterns nobody flagged at review time.

Kai targets any function, kernel, or system flagged as a candidate for optimization. The PR is gated on the benchmark. No measured improvement, no PR. The performance claim is real because it was measured by a system separated from the system that wrote the optimization.

The third is coherence drift. Every contributor leaves a fingerprint, whether that contributor is a human, a copilot, an agent in a long-running terminal session, or a one-shot completion. PRs from different sources naturally express different idioms, different naming choices, different error-handling patterns, different test structures. Across hundreds of PRs and many contributors, the codebase stops feeling like one codebase. Dead paths accumulate because no individual contributor owns the cleanup. The human cost of this is invisible per PR and enormous over a quarter, because every new engineer joining the team now has to navigate a layered archaeology of styles to figure out which one is current.

Kai reads the team's working conventions out of commit history and PR review comments, then reconciles new code against them. It runs the code during PR review, simulates edge cases the original test suite missed, removes dead paths, and ships refactors that come with their own test coverage. The proof is the test suite passing on changes that touched the things the original suite did not cover.

How Kai becomes part of the team

When Kai is connected to a repo, it self-onboards. It reads the READMEs, manifests, lockfiles, CI configs, infrastructure files. It samples the commit history to learn how the team writes code, what abstractions get rejected in review, who owns what, where the dragons live. It writes a workspace blueprint. It proposes initial lifecycle actions. It updates the blueprint after every scan, every PR review, every conversation, every daily cycle.

We call this Workspace Memory. It is anchored to files, symbols, lines, and commit SHAs, which means a learning Kai had about your auth module two months ago is still pointed at the right code today, even after refactors. The result is that Kai's hundredth PR in your repo reads more like your team than its tenth, which reads more like your team than its first.

This is also why Kai can run autonomously across an organization. Not because long-running execution is novel, several systems do that. Because the work that runs autonomously is coordination work, the work that previously required a human in the loop precisely because no other system held enough context to do it. Kai indexes the team. It learns that your security engineer believes auth belongs at the handler boundary and never trusts client IDs. It learns that your backend lead prefers async over .then chains and explicit nullable types. It learns that your VP of engineering has a rule against silent failures. Then it applies those patterns where they belong. The auth fix that ships overnight uses the security engineer's pattern. The checkout refactor uses the backend lead's style. The dead-code purge follows the VP's "delete, don't deprecate" rule. One shared brain across every kind of work. The work in your queue is not the output of an IC agent that was pointed at a problem. It is finished work that a team lead already triaged, written in the voice of the people on your team who would have written it themselves.

Around this memory layer, Kai is wired into the systems where engineering work already happens. GitHub for code, branches, PRs, checks, Dependabot alerts. Jira and Linear for findings. Modal, AWS, HuggingFace, RunPod, Weights and Biases, Railway, Vercel for the runtime, deployment, and performance context that lets Kai connect a code change to the production behavior it actually causes. Every code change Kai considers happens with knowledge of where that code runs.

Each agent runs in its own sandbox. Cron jobs, audits, optimization runs, and PR reviews continue when you log off. Credentials are mediated by the backend and never exposed raw to the agent runtime. Close the laptop. The work continues.

Benchmarks

Kai is measured on the two public tests built specifically to evaluate what an AI engineer can actually do, end to end, on real codebases.

GSO is the SWE-agent optimization benchmark from UC Berkeley. It evaluates whether a coding agent can match expert-developer optimizations on real software, across 102 tasks drawn from 10 codebases and 5 languages. Each task gives the agent a codebase and a performance test as a precise specification, and the agent has to improve runtime efficiency to within 95% of an expert human's optimization while still passing correctness tests. This is the benchmark where leading SWE-agents have historically struggled the most, with most models scoring in the low single digits.

Kai scores 53.3% Opt@1 on GSO. The next-best agent on the leaderboard is Claude 4.6 Opus at 33.55%. GPT-5.2 (high) is at 27.40%. Kai is roughly 1.6× the next-best system on a benchmark designed to be hard. On 29 of the 30 tasks Kai reached human-expert speedup. On 6 of them, it exceeded expert performance.

EVMBench is the OpenAI and Paradigm benchmark for autonomous security research. It evaluates AI agents on detecting, patching, and exploiting high-severity vulnerabilities in Ethereum smart contracts, drawn from 120 curated cases across 40 audits. EVMBench was built because smart contracts are an environment where the consequences of a code vulnerability are directly economic and immediately observable: contracts route over $100B in on-chain value, exploits replay deterministically, and the evaluation grades agents on whether they can produce findings that map to real bounty awards. This makes EVMBench the cleanest public benchmark for whether an AI agent can perform autonomous security work in a setting where the answer is unambiguous.

Kai scores 64.2% detect recall on EVMBench, against $75k in cumulative bounty value for verified findings. The next-best system is Claude Opus 4.6 at 45.6%. GPT-5.3-Codex (high) is at 39.2%. OpenAI's own scaffolds on the leaderboard (OAI-OPT-5.2 at 30%, OPT-5 at 25.5%, OpenAI o3 at 10.6%) sit well below Kai, on a benchmark OpenAI co-released.

These are the benchmarks we trust because they grade what an AI engineer does, not what an autocomplete model can guess. They are also benchmarks where most agents fail to produce competitive results, which is the right kind of test for a system that claims to do this work autonomously.

Full results, methodology, and per-task breakdowns are at kai.dria.co/benchmark.

The receipts

Three recent runs are the clearest evidence we have for what Kai actually does.

Apple, password-manager-resources. The repo powers password autofill across Safari and other browsers, on hundreds of millions of devices. Kai ran against it for 72 hours autonomously. It found an XSS vector in the rule-rendering path, wrote a working exploit in a sandboxed headless browser, fired it to confirm execution, applied a five-character escape fix, and re-ran the exploit to confirm the fix held. Then it filed Issue #1018.

Three days later, Apple merged PR #1019 with exactly the fix Kai proposed. Two maintainers approved, including an Apple engineer.

No human wrote the report, triaged the severity, constructed the exploit, or proposed the fix. That was all Kai.

Coinbase, x402. Kai found a signature bypass in Coinbase's payment protocol that let an attacker forge authorization for any undeployed smart wallet. It cloned the repo into a sandbox, crafted a forged signature, executed the payment flow end-to-end, confirmed the forgery was accepted, applied the fix, and confirmed the fix rejected the same forgery in both Python and Go.

Coinbase triage confirmed it as valid, affecting both codepaths. The same vulnerability was independently reported by human security researchers. Kai found it autonomously.

NVFP4 GEMM kernel, NVIDIA B200. The GPU Mode NVFP4 GEMM competition challenges participants to optimize a 4-bit matrix-multiply kernel on B200, a workload that matters because FP4 inference offers 4× the memory bandwidth of FP16. The reference implementation runs at 24,888.8µs.

Kai ran overnight. Hundreds of iterations across a multi-model ensemble, every candidate benchmarked on the actual B200 hardware. The first direction Kai tried (a raw Triton kernel) was a dead end, and Kai abandoned it. The breakthrough came from realizing the bottleneck was not the kernel itself but the orchestration around it.

The final kernel ran at 161.9µs. 153.8× faster than the reference. 100% correct. On 4× smaller data than FP16, it ran neck and neck with NVIDIA's own FP16 cuBLAS GEMM.

Three case studies, one verifier-first loop. Full mechanism, kernel diffs, and the failed Triton attempt at kai.dria.co/case-studies.

What is shipping today

Kai is live at kai.dria.co. You can connect a repo, give it a night, and see what shows up in your PR queue tomorrow.

The shipped catalog at launch:

The system has 40+ shipped abilities covering security audits, vulnerability triage, supply chain analysis, optimization workflows, evaluator creation, code review, debugging, planning, onboarding, daily workspace cycles, and infrastructure monitoring.

It exposes 20+ first-party tools and 32+ backend MCP tools, with another 20 local codebase analysis and security tools shipping next month.

It integrates with GitHub, Modal, AWS, HuggingFace, RunPod, Weights and Biases, Railway, Vercel, Jira, Linear, and the long tail of MCP servers.

Benchmarks are live at kai.dria.co/benchmark. Case studies are at kai.dria.co/case-studies.

Why we think this matters

The next breakthrough in AI for software is not going to come from a smarter generator. The generators are already very good and getting better on a curve we did not start and cannot accelerate. The breakthrough is going to come from systems that take responsibility for what they ship. Systems that hold the context an engineering organization actually runs on. Systems whose verification is real enough that a senior engineer can trust them with the work senior engineers actually do.

We built Kai because the worst part of every morning had become reviewing code we did not write, from tools that took no responsibility for what they shipped. We wanted a team lead that holds the org-level context, ships PRs with proof, and gets sharper every day in the repo.