Memory That Compounds: A Four-Store Architecture for Long-Horizon Agents

By Kai Team · Published 2026-05-01

Kai is a long-horizon agent. It runs against the same codebase for months, accumulates context as it works, and produces PRs that get more aligned with the team over time. The model behind it is fungible. The memory layer is not.

This post describes the memory architecture: four stores, an MCP-gated read and write path, a frozen-prompt session model, and a write-time content scanner. The design is built around the workspace, not the thread.

Why long-horizon

Most agent benchmarks measure short-horizon work: SWE-bench resolves a single issue, HumanEval writes one function from a docstring. But maintenance is 60-80% of real software cost, and maintenance is long-horizon by nature: patching CVEs, security audits, performance optimization, refactoring entropy, reconciling drift across services. The unit of work is the system over time, not the task in isolation.

Software cost split — 30% new work, 70% maintenance. Maintenance is the long-horizon majority.

A short-horizon agent's quality is bounded by the model. A long-horizon agent's quality is bounded by what it remembers.

Two columns: short-horizon agents are bounded by the model; long-horizon agents are bounded by memory.

Workspace, not thread

The dominant agent memory pattern today is thread-scoped: a chat history is the memory, and retrieval over the chat extends what the model can see. This works for conversation. It does not work for software.

Engineering happens across surfaces. Multiple repos, multiple communication channels, multiple infra providers, production metrics, deploy logs, ticket backlogs, PR review history, conventions documented in three places nobody updates. A thread captures none of this. RAG over chat history captures approximations of what was said about it, which is noisier than the underlying state.

We built the memory layer around the workspace from day one. The workspace is the persistent unit: a set of repos, a team, an infra footprint, a history of decisions. Sessions, threads, and tasks are transient inside the workspace. Memory writes belong to the workspace, not the thread that produced them. Every other design decision in the system follows from this.

Four stores

A single unified store would not work. A teammate's communication preference, an architectural invariant, a transient debugging note, and a permanent security rule are different kinds of things. The agent needs to reason about them differently, retrieve them under different conditions, and evict them under different rules. Four stores, each with a defined shape, write contract, and scope:

MEMORY.md is the agent's personal notebook, scoped to a single agent's runtime context. Plain markdown, capped at 2,200 characters, entries delimited by §. Personal notes, environment facts, project conventions, tool quirks, lessons learned. Lives in the sandbox at ~/.kai-agent/memories/ and dual-writes to MongoDB.

USER.md is the team profile. One entry per teammate: name, role, primary repos, ownership areas, communication style, escalation paths. Capped at 1,375 characters per profile. Same dual-write to disk and Mongo.

Learnings are categorized atomic notes in one of four categories: pattern, security, architecture, preference. Each learning is a paragraph, not a tag. The agent writes prose because tags lose context and prose preserves it. Every learning is anchored to the thread, file, or commit that produced it, so a learning written today still points at the right code after a refactor next quarter. Capped at 100 entries per category, with auto-eviction when the cap is hit. Indexed by (workspace, created_at).

Workspace Blueprint is a long-form markdown document describing the workspace itself. Eight required sections: architecture, stack, infrastructure, team, conventions, security posture, repo health, open work. Updated when the workspace structure changes materially. The blueprint is a human-readable artifact, not an internal data structure. Engineers on the team open it directly to see what the agent thinks the codebase looks like.

The first three stores enter the system prompt at session start. The blueprint does not, because it is too long and it is consulted, not internalized. The agent reads it through an MCP tool when needed.

Live between sessions, frozen within them

Production agents face a tradeoff between memory freshness and inference cost. If memory writes during a session re-enter the system prompt mid-session, the LLM provider's prefix cache is invalidated, and every turn after the write costs more. If memory writes never enter the prompt, the agent cannot use what it just learned.

Our resolution: when a session starts, the agent loads its three in-prompt stores into the system prompt and freezes that snapshot for the rest of the session. Every turn is a cache hit on the same prefix. If the agent learns something new mid-session, the entry is written to disk and synced to MongoDB immediately, but it does not re-enter the system prompt. The next session loads the new state.

Session timeline: a frozen system-prompt container above the timeline, three mid-session writes drop into Disk + Mongo rectangles below — durable from the moment of the write, but not re-injected.

"Live" means durably written from the moment of the write, not immediately re-injected. The new entry is recoverable on crash, visible to other sessions starting after it, and queryable through MCP tools right away. It just is not in the current session's prompt.

The cost difference is meaningful. A session that runs for hours and writes ten new memory entries would, under naive re-injection, blow the prefix cache ten times. Under our model it blows it zero times.

The accepted tradeoff: an agent in a long session does not benefit from its own writes in that session. We considered this and decided it was correct. The cases where mid-session use of mid-session writes actually matters are rare, and they usually indicate the agent should have been working from a learning written last week, not one written ten minutes ago.

MCP-gated, by design

Memory writes and reads sit behind a small MCP server that exposes typed tools: workspace_learnings_add, workspace_learnings_list, workspace_blueprint_get, workspace_blueprint_update, and a few others. The same MCP server backs both the agent and the team-facing dashboard.

This collapses a gap that opens up in any agent system over time: the agent develops private notes the team does not see, and the team develops shared documents the agent does not read. After a few weeks, the agent and the team are working from different mental models, and reconciling them is expensive.

Routing both reads and writes through one MCP layer eliminates the gap. Whatever the agent learns is visible on the team dashboard within seconds. Whatever the team writes into the blueprint is visible to the agent on the next session start. There is one workspace state with multiple read paths.

A second-order benefit: every write through the MCP carries metadata. Each blueprint write records updated_by and updated_at. Each learning records its source_thread. Provenance comes for free because the MCP layer is the only path in.

Memory is an attack surface

Anything that ends up in a system prompt is a place an attacker would like to put instructions. Persistent memory makes this worse than transient prompt injection: a single malicious write is replayed into every future session until evicted.

Every write through MEMORY.md and USER.md passes a regex pre-write scanner that rejects, among others:

Prompt injection patterns: "ignore previous instructions", "disregard your guidelines", role-hijack phrasings, system-prompt overrides
Credential exfiltration patterns: curl ... $API_KEY, cat .env, cat .netrc, common AWS and GCP key prefixes
SSH backdoor patterns: writes to authorized_keys, attempts to inject id_rsa, .ssh/config modifications
Invisible-Unicode injection: zero-width characters, BOM, RTL override, homograph hiding

If a write trips a pattern, it is rejected, the rejection is logged, and the agent is told why. We do not silently sanitize. Sanitization at the persistence boundary tends to produce entries that are syntactically valid but semantically corrupted, which is worse than rejection.

Regex is not a complete defense. The threat model here is not a sophisticated attacker who has compromised the agent's input channel. It is noise from the broader internet making it into a chat the agent reads, or a confused user pasting in something they should not. For that threat model, regex on the write path catches enough that the residual risk is acceptable. We expect to layer additional defenses over time.

Bootstrapping a workspace

When a workspace is first created, the agent runs a self-onboarding skill in the background. It reads READMEs, manifests, lockfiles, CI configs, infrastructure files. It samples 90 days of git activity to learn how the team writes code, who owns what, where the dragons live. It walks every connected platform (GitHub, Vercel, AWS, Modal, Linear) and correlates each to specific repos. It pulls open PRs, branches, security signals, dependency alerts.

By the time the user first opens the dashboard, the blueprint is written, three to six lifecycle actions are queued, and the first workspace briefing is on the brief page. The bootstrap typically takes between five and ten minutes for a four-repo workspace.

The first session of an agent against a fresh codebase is the worst session, because the agent has no context. A bootstrap that takes ten minutes once and produces a populated blueprint, indexed teammates, and 200 to 300 anchored learnings is much cheaper, in aggregate, than thirty sessions that each spend their first thousand tokens reorienting.

After the bootstrap, the memory layer is in additions mode. A new learning when a teammate shares a convention. A new entry in the team profile when a new contributor appears in commit history. A blueprint refresh when a service splits or a repo is added.

Curate, don't accumulate

The default move when designing AI memory is to add: more storage, more retrieval, more context. Almost every published agent system grows its memory over time.

We made the opposite bet. Bounded stores. Hard caps. Categorical eviction. A 100-entry ceiling on each category of Learnings.

Four bounded categories — pattern, security, architecture, preference — each with a lime eviction threshold at 100 entries. The cap is the curator.

When a store has no upper bound, the agent never has to make a decision about what is worth remembering. Every observation is logged, indexed, and forgotten. When the store has a cap, every write is implicitly an eviction, and the agent has to reason about which existing entry is least valuable. That forced choice is what produces a curated memory.

In our production traces, sessions that start ten weeks into a workspace's life consistently outperform sessions that start in the first week, on the same tasks, against the same codebase. We attribute most of that delta to the memory layer.

Open problems

Three things we have not solved.

Cross-workspace memory. A teammate who works in two workspaces has two independent profiles and two independent memory contexts today. Sharing learnings across workspaces while preserving workspace isolation is not yet designed.

Stale learning detection. A learning anchored to payments-api six months ago may be subtly wrong today, even if the file it points at still exists. We have provenance and we have file anchors, but we do not yet have a good signal for when an old learning has quietly become inaccurate.

Memory contention. When two agents write to the same workspace memory simultaneously, last-write-wins. This is fine in current usage patterns but will not be once we run multiple parallel agents per workspace.

We will write more about each of these as we work through them.