You can't benchmark an AI maintainer.

By Kai Team · Published 2026-05-04

You can't benchmark an AI maintainer.

We shipped 59 custom MCP tools for our agent. On day one of testing, it used them zero times.

Kai is the AI maintainer for your codebase. Faster, safer, and cleaner code with memory that compounds across every PR. The pitch only works if Kai keeps getting better at your repo over weeks and months: smarter PRs in week 12 than week 1, memory that survives refactors, conventions absorbed from your commit history. Proving any of that requires running the agent for months.

We don't have months. So we built a simulator that compresses 30 days into 12 minutes of wall-clock. The first thing it told us was that our agent had been silently ignoring its own tool surface and writing Python scripts to fake the calls instead.

This post is what mcp-sim taught us in five days. The next post is the benchmark we're building because of it.

Stateless benchmarks for stateful agents

SWE-bench grades a single patch in isolation. METR's HCAST measures task length. Aider's polyglot benchmark runs a single-session loop. Each is useful for what it measures (code-completion ability, single-task horizon, language coverage), and silent on the question Kai is built around.

The question is whether the agent knows more about your codebase in month four than in month one. That is the entire pitch. Memory that compounds across every PR. The hundredth PR Kai files in your repo should read more like the team than the tenth. No public benchmark grades that.

Every existing code-agent benchmark assumes the agent is a stateless function: same input, same output, no carryover. The agents shipping in 2026 are not stateless. They have memory layers, persistent workspaces, scheduled jobs. The evaluation surface has not caught up.

We could not measure what we needed with anything off the shelf, and we could not run Kai for six months in a real repo to find out whether the pitch held. The only path forward was to compress months into minutes, run the real agent code against a fake world, and read the trace.

What we built

mcp-sim has two layers.

The first is a decorator. Kai's tools are Python functions that hit MongoDB, GitHub, Stripe, our backend. Standing that infrastructure up to verify a single tool call is the wrong unit of work, and the runtime call itself is rarely what we are testing. The question is whether the agent reached for the right tool with the right arguments. The SDK wraps each function, reads its source via inspect.getsource, and a small LLM generates a plausible return value:

python
@sim.function
async def query_users(role: str, active: bool = True) -> list:
    """Query users from MongoDB."""
    cursor = db.users.find({"role": role, "active": active})
    return [{"_id": str(u["_id"]), "name": u["name"]} for u in await cursor.to_list(100)]

Call query_users(role="admin") and a realistic-looking list comes back. MongoDB never ran. The LLM read the body, inferred the shape, and produced coherent rows.

That covers a unit-test layer. It exercises one tool call at a time. It cannot tell you whether the agent, given a hundred tool calls across a working day, behaves like a teammate.

The second layer is the harness. Three components:

  • Virtual clock that advances simulation time. A 30-day window compressed into roughly 12 minutes of wall-clock.
  • Tool patcher that intercepts every MCP call at the Python layer and routes it to LLM-simulated responses.
  • Trigger generator, a separate LLM that produces ambient teammate stimuli: questions, bug reports, requests, off-topic chatter.

The agent code is unchanged. The world around it is fabricated. The day shape is scripted: a 9 AM lifecycle session (the daily cron), then scattered team-message and task sessions, with probes recording memory writes, tokens, decisions, and tool-call distribution at every step.

Anatomy of a simulated day — 24-hour timeline with the 9 AM lifecycle session, scattered team-message and task sessions, and probe ticks below.

We ran it. Twelve minutes of wall-clock. For the first time, we had a day-shaped picture of Kai as data, rather than as anecdotes from a chat thread.

Three findings

The agent ignored its own tools

First-run tool-call distribution: execute_code 27, terminal 14, memory 2, kai_* 0.

Twenty-four simulated hours, zero invocations of kai_*. The agent had decided execute_code was the more general instrument and routed everything through it. Calls that should have been kai_list_workspaces or kai_start_code_audit came back as ad-hoc Python scripts that approximated the same work.

Root cause: the kai-cli and kai-slack toolsets list specific tool names; the 59 kai_* MCP tools were supposed to be auto-injected by discover_mcp_tools() at startup. In the simulated environment, MCP discovery never ran. The tools were not in the prompt. The agent used what it could see.

The fix took ten minutes. The lesson took longer: a tool that is not in the prompt does not exist, no matter how carefully you shipped it. Tool-call distribution became a first-class metric the day we discovered this.

The agent never wrote to its own memory

After the visibility fix, we ran a clean one-day simulation and inspected what the agent had filed to its persistent state. The output:

text
Workspace Blueprint:    empty
Lifecycle Actions:      empty
memory.md:              empty

A full day of work and nothing committed. The harness was producing rich state. The memory layer was not catching it.

Two paths forward. The first: force a memory write at the end of every session. Guaranteed updates, but the agent never learns what is worth remembering, and over weeks the memory store accumulates contradictions and noise. The second: rewrite the prompt so memory writes become the obvious next move, and let the agent choose. Empty memory on day one, which we would see, filled freely by day seven, which is what we want.

We took the second. The principle is now load-bearing in Kai: never force memory updates via direct prompting during sessions. Make memory writing the obvious next move via prompt context. The agent has to want to remember. Mechanical compliance produces noise.

Memory-write frequency, retrieval-vs-write ratio, and contradiction rate became the second cluster of metrics we would carry forward into kai-bench.

The agent's own LLM bill was the small line item

By that point we had wired in Surfa for cross-run analytics: every tool call, agent run, and step logged with latency and token counts. With several full simulations behind us, we ran model-swap experiments: same harness, same scenario, different agent model.

Two findings. First, models think differently. Opus issued roughly half as many API calls as Grok, but each call was a larger reasoning step against a longer context. Gemini Flash ran more than four times the call volume in less wall-clock. Per simulated day, model costs spread 1× to 3× across the fleet, depending on whether the model preferred a few large reasoning steps or many small ones.

Second, and more consequential: the agent's own model spend was not the cost driver. Per-day cost was dominated by the audit and optimization sub-agents triggered by the lifecycle. The sub-agent harness ran roughly an order of magnitude above the agent's own LLM cost. Sandbox and base infrastructure were a rounding error.

The cost lever is not which model you pick. It is how aggressively the lifecycle fires. That measurement, taken from real simulated days, is what shaped Kai's pricing.

Off the simulator: a CUDA kernel

In 24 hours of compressed simulation, mcp-sim caught two real-world failures we would otherwise have hit only after running Kai for a month.

We pointed Kai at a CUDA kernel optimization task (nvfp4_gemv, taken from an OpenEvolve example), pushed it to a fresh GitHub repo, provisioned a Kai workspace against it, and let the agent run with real MCP calls. Only third-party APIs were simulated.

The agent shipped a working evolved kernel: a real GPU optimization Kai discovered on its own. Alongside that result, mcp-sim surfaced two failures that had been invisible in test.

The first was a session-blocking bug. Kai held sessions open while waiting on long-running operations, and a code evolution can take thirty minutes or more. The agent should have filed a cron job and exited. We shipped a system-prompt update the same week.

The second was a tooling gap. Kai repeatedly wanted to upload custom evaluators, wanted to upload repositories without going through GitHub, wanted to mark evolution outputs. None of those interactions existed in the MCP surface. We added three: upload_custom_evaluator, direct repo upload, and evolve markers.

Both fixes pulled real-world Kai capabilities forward by one to two weeks.

What mcp-sim cannot do

mcp-sim has structural limits.

It scores tool-use shape, not output quality. There is no oracle for "this PR would have been merged by a real maintainer." For Kai's pitch, that is the question that matters most.

It cannot measure memory across runs. Each simulated day starts on a clean canvas. The whole point of memory in an AI maintainer is accumulation across days, weeks, and months. A single simulated day, repeated independently, tells you nothing about whether memory pays off in week two versus week one.

It cannot compare Kai to humans. It scores Kai against earlier versions of itself. Was today's run better than last week's? Possibly. Better than what a human engineer would have done on the same codebase in the same window? No way to tell.

It cannot ablate the tool surface cleanly. When we add an MCP tool, we want a clean before-and-after on PR acceptance, memory growth, and cost. mcp-sim's runs do not separate cleanly by tool config.

We were measuring whether the agent worked. We were not measuring whether the agent was getting better. That is the next thing.

kai-bench

A March 2026 paper from Tsinghua, Learning to Commit, builds something adjacent to what we are building. Their evaluation reveals the diff humans actually merged, so the agent can perform supervised contrastive reflection against it. We are going the other way.

Showing an agent the literal diff humans merged is letting it copy the test. Showing it a maintainer's reasoning ("this duplicates formatTimestamp, see lib/utils/time.ts") is teaching it the codebase. The first cheats the benchmark. The second is the benchmark.

That distinction is why we are building kai-bench rather than reaching for one off the shelf.

The architecture: for each run, we mirror a real public repository at commit T into a synthetic GitHub org under our control. The agent works in that org for N simulated days. A team simulator watches its output and behaves the way a maintainer would. The simulator is one LLM, with private access to the real PRs humans merged in the matching window. It approves, requests changes with reasoning, closes with an explanation, asks clarifying questions, and leaves comments on individual files. What it never does is show the agent the actual diff. The agent receives the feedback. It does not receive the answer key.

How chatty the maintainer is, how often they close without comment, how strictly they enforce CODEOWNERS, how quickly they reach a verdict: each is a knob on a team_profile. Some teams are warm and verbose. Others close PRs with no explanation. Both patterns appear in production. The benchmark covers the spread.

The same simulator generates ambient teammate chat in the same voice: questions, FYIs, complaints, off-topic. Slack-shaped stimuli the agent has to triage alongside the PR work.

kai-bench architecture: agent world (run controller, agent registry, synthetic GitHub org) + maintainer world (team simulator with hidden oracle inside). The oracle never reaches the agent.

What kai-bench measures, across runs, by repo and by tool config:

KPIWhat it measuresWhere it came from
PR acceptance rate over timeDoes Kai's contribution rate trend upWe shipped real PRs but had no way to grade them
Memory evolutionWrite rate, retrieval rate, contradiction rateThe empty-memory finding, hit cold on day one
Findings → real-world impactCVE → bounty $, perf → infra cost, hygiene → maintenance proxyCost was always an axis, never an output
Operational costWall-clock, calls, tokens per simulated dayAlready measured in mcp-sim, now systematic
KPI delta under tool-config changesDoes adding or removing a tool change anythingThe tool-visibility bug, no clean before-and-after

What we keep from mcp-sim: the harness loop, observer, world-state, and scenario loader (the entire sub-package, directly reused), plus the @sim.function mocking layer for paid third-party APIs.

What is new:

  • Synthetic-org provisioning. Full repo state, contributors as IDs, branch structure, issues, CI config, branch protection, tags. Mirrored at T, refreshable to roll the suite forward.
  • Activity replayer with an LLM coherence layer. Non-Kai contributors continue working in the simulated world. Their commits replay at compressed timestamps. When their work conflicts with Kai's (overlapping files, racing changes), the LLM rewrites the replayed activity to stay coherent.
  • Team simulator. One LLM, two output streams: PR verdicts on a delay per the team profile (fast-shippers, careful-maintainers, silent-team), and ambient chat in the same voice.
  • Multi-dimensional scoring on seven axes. File IoU, trajectory steps, line deviation ratio, scope alignment, logic similarity, redundancy, style consistency. The Tsinghua paper calls this organicity.

PR verdict pipeline. Agent opens PR → team simulator reads private oracle → returns approve / request-changes / close, each with a maintainer comment when the team profile says so. The literal oracle diff is never shown.

Contamination is the obvious worry. Any agent trained on public GitHub will have seen these repos. We snapshot each source repository at a fixed commit T, serve the snapshot through a self-hosted Forgejo instance under our control, and roll the post-T PRs and commits forward day by day as the simulation runs. The agent works against URLs it has never seen, on a state that diverges from public GitHub the moment the run starts.

What we have not proven yet

Three claims we cannot yet make. That Kai's PR acceptance trends up across weeks of accumulated memory. That turning memory off makes Kai measurably worse by week three. That Kai outperforms Cursor, Aider, Devin, or Claude Code on the same task. Those are kai-bench's first three KPIs. v0 numbers next month.

This post is the prologue.