A Developer Inside a Code Model
By Kai Team · Published 2026-05-12
Fine-tuning a code model on one developer's 20 GitHub repos did not teach it new things. It changed which of its existing circuits got used, how strongly, and on which tokens. Here is how we read that, what almost fooled us, and what it means for codebase coherence across humans, Cursor, Copilot, Claude Code, and Codex.
Every developer leaves a shape on the code they write. Naming patterns, idioms, defaults, which crate or import they reach for when context is ambiguous. Experienced reviewers read it without thinking. The question we set out to answer is whether that shape lives inside a model trained on that developer's code, and what kind of object it is if it does.
We fine-tuned Qwen2.5-Coder-1.5B on one developer's 20 GitHub repos and looked inside. The base model already knew Rust, Python, TypeScript, and Go. After fine-tuning, what changed was not knowledge but routing: which of its existing circuits got used, how strongly, and on which tokens. The LoRA did not add a developer to the model. It promoted the developer's preferred routes from background to foreground.
The shift is observable at the token level. Next-token prediction on pub use crate:: moves from 'error' at p=0.039 on the base model to 'config' at p=0.260 on the LoRA-merged model. 'error' is a reasonable language-model guess for what follows pub use crate:: on the open internet. 'config' is what follows it in this developer's actual Rust crates. The fine-tune rewired prediction toward an existing identifier without teaching the model anything new about Rust.
This is what a personal-style fine-tune looks like at the parameter level. The rest of this post is how we read it, what almost fooled us, and why the result matters for keeping a codebase coherent across a growing population of human and IC-agent contributors.
Setup, briefly
Base: Qwen2.5-Coder-1.5B, 28-layer transformer, generic code pretraining mix. Adapter: rank-32 LoRA over 6,374 files (17M tokens) across 13 Python repos, 3 Rust crates, 3 TypeScript apps, 1 Go service. Training loss 1.05 → 0.62.
We then read the merged model with three independent methods:
- SAE drift: a TopK sparse autoencoder on layer-12 activations, diffing firing rates between BASE and LoRA-merged.
- Variational parameter decomposition (VPD): Goodfire's full recipe on the weights, with rank-1 components, a per-token gating network, and a persistent-PGD adversarial attack on the gate sources.
- Multi-layer attribution: eight independently trained VPDs (layers 0/4/8/12/16/20/24/27), stitched into one system, with integrated gradients on the gates for a specific (prompt, target-token) pair.
The three reads disagree with each other in instructive ways.
The wrong answer (twice)
SAE drift on layer 12 turns up 1,848 features with non-trivial rate changes. The 99th percentile of |log-ratio| reaches 11.46. The three biggest movers, autolabeled by DeepSeek-V3: Django field detectors, Python optparse imports, _VERSION tokens in package metadata. Looks like a small library of personal coding concepts bolted onto the base model.
It is not. SAE drift measures firing rate, not causal contribution. If the evaluation slice contains more Rust or more Django than the BASE pretraining distribution, every feature aligned with that distribution looks "newly important" whether or not the LoRA touched it.
VPD on the weights, run faithfully (the persistent adversarial inner loop is the active ingredient; without it, gates collapse to 99% above 0.5 and the decomposition stops carrying information), surfaces a small alive set of subcomponents at each layer. Autolabels read like a competent description of the corpus: pub and use as Rust module-visibility declarations, /// as Rust doc comments, /** as TypeScript JSDoc. Clean writeup. Reviewers would have nodded.
Browse the full alive set yourself. Use the matrix dropdown to switch between Q, K, V, O, MLP-up, MLP-down; the BASE / +LoRA pill toggles between the two trained models on the same corpus.
We ran the control. Take BASE Coder-1.5B with no LoRA merged, train a fresh VPD on the same corpus, autolabel with the same prompt. Same Rust labels surfaced. Different component indices in a different model, same labels.
The labels described the corpus, not the LoRA. VPD is faithful to the data you train it on. Feed it Rust, it finds Rust-shaped directions, because the base model already had them.
The real question is not which components are alive, but how the alive set differs between BASE and LoRA on the same input.
What the LoRA actually did
Holding the corpus fixed and diffing the alive set at layer 12:
None of this is new concepts. It is routing: which directions the model uses, how strongly it gates them, how much of attention is dedicated to them. The LoRA did not bolt on Rust knowledge. It promoted Rust knowledge from background to foreground.
The pub use crate:: shift from the intro becomes legible at the parameter level: BASE makes its top-1 prediction off sparse V-row activity only. The LoRA-merged model fires K, V, and MLP components across every one of the 8 trained layers. On the attribution graph, the "pub" column lights up across all trained layers and the red-and-blue circle network widens by roughly 50%.
Easiest way to see this is integrated gradients on the gates of the full 8-layer system, computed for the predicted next-token logit. Each circle below is one alive component firing at one token; lime raises the prediction, copper suppresses it. Toggle BASE / +LoRA on a single prompt to watch the topology change.
Why this matters for coherence
Coherence in an AI-augmented engineering org is not a knowledge problem. Cursor, Claude Code, Copilot, Codex, and humans all know roughly the same things about Rust, Python, and TypeScript. They differ in routing: which idioms they reach for, how aggressively they use a given pattern, what defaults they pick when context is ambiguous. The fingerprint is in the routing.
That maps cleanly onto what coherence actually requires inside Kai:
- Reading team conventions from commit history and PR review is a routing-extraction problem. The base model already has every idiom the team uses; the team's identity is in which of those idioms get promoted in which contexts.
- Reconciling new code against those conventions is a routing-comparison problem. A diff that passes lint and tests can still route differently than the rest of the codebase, and that drift is exactly what coherence work catches.
- Dedup of overlapping logic and removal of dead paths is downstream of having a stable routing signature to compare new code against. Without it, "this looks like the team" stays a vibe.
Two engineers in the same codebase write syntactically near-identical code and still train materially different LoRAs against the same base, because their routing deltas are different. A LoRA on engineer A's PRs against the same base is a candidate signature for how A's code asks the model to think. The same is true for an IC agent. A LoRA on Cursor-generated PRs in a given repo is a fingerprint for how Cursor routes in that context. Same for Copilot, Claude Code, Codex, and many more.
We are not productizing per-engineer or per-agent fingerprinting on the strength of this single experiment. The model is small (1.5B). The VPD training is short (1,500 steps versus the paper's 400,000). The LoRA is one person on twenty repos. The methodology composes upward, but the case study is a case study.
What we are saying: the right unit of analysis for coherence is routing, and routing is observable. Future Memory work in Kai operates against that unit rather than against shallower proxies like AST patterns or commit-message style.
Limits worth stating
- Scale. The VPD paper trains for 400,000 steps with a shared transformer Γ over all matrices. We trained for 1,500 steps with a per-matrix MLP Γ. Numbers are directional, not absolute.
- Single individual. One person's 20 repos. No statistical comparison to a population. The fingerprinting claim is case-study level.
- No causal ablation yet. The 8-layer attribution graph is a faithful linearization of the gate-level Jacobian via integrated gradients. We have not run targeted ablations to confirm that suppressing a highlighted circle changes the predicted token. That is the obvious next experiment.
- A β_Δ bug. Early VPD runs had a hardcoded 1e-3 in the main-phase loss that should have been 1e7. We caught it and re-ran. Numbers in this post are from the corrected runs.
- Reconstruction is exact by construction. ‖W − ΣUᵢVᵢᵀ − Δ‖₂ ≈ 0 because Δ absorbs whatever the rank-1 decomposition misses. What is load-bearing is gate sparsity, not the residual.
None of these change the headline. The corpus drives which components look interesting (this is the false-positive we nearly shipped). The LoRA drives routing (this is the result).
Open kit
Everything is reproducible. Repo at github.com/eren23/coder-interp-tap, runs end-to-end on a single RTX 4090 or A6000 pod, each project YAML reproduces one stage. W&B runs for the LoRA-merged VPD, the BASE control VPD, the SAE feature diff, and both attribution graphs are linked in the original writeup.
If you build on this, especially in the direction of per-agent or per-engineer fingerprinting, we want to hear about it.