Kai · Research Agent · SDFT Training Report

Teaching a Model to
Teach Itself

How an evolutionary engine discovered that a 16× learning rate, curriculum ordering, and progressive token weighting can drive eval loss from 8.79 to 0.003 in thirty steps.

Run: 2025-06-14 to 2025-06-15Model: Qwen3.5-4BInfra: TinkerEngine: KAI Evolve

0.9966

Final Score

from 0.1022

2,585×

Loss Reduction

8.79 → 0.003

2.4×

Faster

38 min → 16 min

Iters to Best

of 20 productive

The Problem: Thirty Steps Is Almost Nothing

Self-Distillation Fine-Tuning (SDFT) is a beautifully simple idea. You take a frozen teacher model, condition it on demonstrations, let it generate completions, and then train a student model to reproduce those completions via cross-entropy loss (forward KL divergence). The student learns to mimic the teacher. The teacher never changes. On paper it should work like a charm. For this run, all training and inference happened through Tinker, Thinking Machines' training API, which abstracts away the infrastructure and exposes model training as four clean function calls: forward, backward, optimize, sample.

But here is the constraint that makes things interesting: we deliberately set a limit of 30 training steps on a fixed subset of 2,048 training pairs. Thirty steps is almost nothing. A standard learning rate of 5e-5 will barely move the weights. A warmup schedule spanning 20 of your 30 steps will spend two-thirds of the budget accomplishing nothing. And if your generation budget is 1,024 tokens per sample, you are waiting twice as long as you need to.

The question becomes: can an evolutionary search, guided by an LLM, discover a training recipe that actually works within this brutal constraint? The scoring function is simple and honest:

Scoring Function

combined_score = 1 / (1 + eval_token_loss)

A perfect score of 1.0 means zero eval loss: the student and teacher are indistinguishable. Kai used MAP-Elites with 4 islands and two feature dimensions: convergence_speed and train_eval_gap. Each candidate program was evaluated end-to-end, no shortcuts.

The Full Trajectory

Let's look at every iteration. Twenty programs actually ran; twelve crashed due to billing exhaustion (HTTP 402 from the Tinker API), and two had code errors. The chart below tells the whole story. Notice how the winning iterations (gold) are not even close to the majority. Most of evolution is noise. The signal is rare and precious.

Combined Score Across All 20 Iterations

Higher is better. Color indicates iteration outcome.

The baseline (iter 0) scored 0.102, which is terrible. An eval loss of 8.79 means the student is essentially producing noise relative to the teacher. Then iter 2 happened: a single mutation jumped the score from 0.358 to 0.887. That is a +0.529 improvement in one generation. Nothing else in the run comes close.

The Loss Curve Nobody Expected

When you plot eval loss on a log scale across only the record-setting iterations, something beautiful emerges: a nearly perfect exponential decay. Each generation of the winning lineage shaved roughly one order of magnitude off the loss. From 8.79 to 1.79 to 0.13 to 0.04 to 0.02 to 0.008 to 0.003. Six jumps across fourteen iterations. The LLM was not stumbling; it was systematically tightening the screws.

Eval Loss (Log Scale) — Record-Breaking Iterations Only

Each point is a new personal best. Log10 scale.

Diminishing Returns, or: Why Evolution Converges

Here is a fact that should feel both satisfying and inevitable: each new record brought a smaller absolute improvement than the last. The first meaningful jump was +0.529. The final one was +0.005. This is the signature of a system approaching its optimum. The low-hanging fruit gets picked first. Then you are left polishing decimals.

Score Delta at Each Record-Breaking Iteration

How much each new record improved over the previous best.

The tallest bar belongs to iter 2. Whatever mutation the LLM applied there was the single most consequential decision of the entire run. Everything after was refinement.

The Phylogeny of Programs

Kai maintains an explicit parent-child tree. Every new program descends from an existing one. What the tree reveals is a single dominant lineage cutting through a field of failed experiments.

Phylogeny of Programs

i0(0.102)BASELINE

i1(0.358)◆first improvement

i2(0.887)◆breakthrough

i3 (0.269) overfit, dead end

i16 (0.439) regression

i5(0.958)◆refining

i6 (0.000) crash

i7, i9, i10 mediocre branches

i8(0.984)◆fine-tuning

i12, i13 diminishing

i11(0.992)◆polishing

i15 (0.984) close

i14(0.997)★FINAL BEST

◆ = record · ★= final best · Grey = non-winning branches

The champion lineage is i0, i1, i2, i5, i8, i11, i14. Six mutations from baseline to final best. Every side branch either crashed (i6), overfitted (i3), regressed (i16), or plateaued. The winning line never forked. It just kept refining.

This is a pattern worth paying attention to. In evolutionary search, early mutations have outsized long-term effects. The iter 2 mutation was not just a good generation; it was the ancestor of all future improvement. Like finding a fertile valley early in a search through rugged terrain. Once you are in the right basin, gradient-like refinements keep working. If you are in the wrong one, they just make you the best version of a mediocre idea.

What the LLM Discovered

Now for the part that feels almost eerie. The LLM was not given any instructions about what hyperparameters to change or in what direction. It received only the training script, the scoring function, and the results of previous evaluations. Here is what it figured out:

Parameter	Baseline	Best (i14)	Change
Learning rate	5e-5	8e-4	16× higher
Warmup steps	20	2	10× fewer
Batch size	32	20	0.625×
Max gen tokens	1024	512	0.5× (2× faster)
Temperature	0.7	0.55	lower (focused)
LR schedule	cosine → 0%	cosine → 0.5%	nonzero floor
Eval frequency	every 15 steps	every 5 steps	3× more frequent
Curriculum	none	sort by demo length	easy-first
Token weighting	uniform	ramp 0.5 → 1.5	later tokens matter more
Batching	single pass	3-pass curriculum	structure + diversity

Learning Rate Across Record-Breaking Iterations

The engine discovered that 16\u00d7 higher LR works better with minimal warmup.

Discovery Timeline

16× learning rateminimal warmuphalved generation

curriculum learninglower temperaturefrequent eval

token weighting rampLR decay floor

i11

3-pass batching

i14

all parameters refined

The single most impactful change was a 16× increase in learning rate. With only 30 steps, the baseline's 5e-5 was whispering to the weights. The LLM cranked it to 8e-4 on iter 2, which is exactly the iteration that produced the largest score jump. This is not a subtle finding. It is a sledgehammer insight: when your step budget is tiny, you must learn fast or not at all.

What makes this sequence beautiful is the ordering. The LLM solved the most important problem first (learning rate), then added structure (curriculum), then added finesse (token weighting, LR floor), and finally combined everything (3-pass batching). It prioritized like a good engineer would.

Training Efficiency: Getting Faster While Getting Better

A common anxiety with evolutionary methods: do they just find better solutions by spending more compute? Here, the opposite happened. The best program is not only more accurate but 2.4× faster than the baseline. The main savings come from halving the generation token budget (1024 to 512) and reducing eval samples (50 to 30). Less waste, tighter focus.

Evaluation Time per Iteration (minutes)

Blue highlights indicate record-breaking iterations.

Total wall-clock time across all 20 iterations: approximately 8.5 hours. The engine found near-optimal hyperparameters in under 9 hours of compute — a task that would take a human researcher days of manual experimentation.

Failure Is the Norm

Of 20 productive iterations, only 6 set new records. The rest were regressions, plateaus, or outright crashes. Iter 3 produced catastrophic overfitting (train-eval gap of 2.11). Iter 6 introduced an undefined variable. Iter 16 regressed to 0.44 from a near-perfect starting point.

This is how evolutionary search actually works. The majority of mutations are harmful or neutral. You survive on the rare beneficial ones. What MAP-Elites provides is a memory: the good solutions are preserved, and bad ones die without corrupting the archive. The four islands provide additional insurance through diversity. Island D went entirely extinct. Islands B and C found decent configurations but could not compete. Island A kept steadily refining its winning lineage.

MAP-Elites Island Status at Termination

Island A

0.997

best at iter 14

Champion lineage

Island B

0.887

best at iter 2

Good, eclipsed

Island C

0.581

best at iter 7

Struggling

Island D

0.000

best at iter n/a

Extinct

What Does This Mean?

There is something deeply satisfying about watching an LLM discover, through trial and error, that a 30-step training budget demands aggressive learning rates and minimal warmup. This is not information that was explicitly in the prompt. The LLM had to infer it from the scoring signal: the baseline scored 0.10, and the version with 16× higher learning rate scored 0.89. From that single observation, the winning branch was born.

The curriculum ordering discovery is perhaps even more interesting. Sorting training pairs by demonstration length (easy examples first) is a well-known technique in human pedagogy and occasionally in ML. But the LLM was never told about curriculum learning. It arrived at the idea by evolving the batching strategy and observing that structured ordering produced lower loss than shuffled ordering. It reinvented the wheel, but it reinvented it correctly.

The progressive token weighting (ramping from 0.5 to 1.5 across the sequence, normalized to mean 1.0) is the most subtle finding. It says: the later tokens in a completion matter more. This makes intuitive sense for autoregressive models, since early tokens are largely determined by the prompt while later tokens require genuine model competence. But again, nobody told the LLM this. It discovered the asymmetry by trying it and watching the loss drop.

Final Results

eval_token_loss

8.7892

baseline

→

0.0034

final best

2,585× reduction in 14 iterations

The evolved recipe achieves near-perfect student-teacher alignment in 30 training steps and 16 minutes of compute. It is 2.4× faster than the baseline while being 2,585× more accurate. The entire evolution took about 9 hours.

Evolution works. Not because every mutation is good, but because the good ones compound. All it takes is one fertile ancestor and the patience to let selection do its work.

Report generated from Kai research agent run data.

Evolution engine: MAP-Elites (4 islands) · LLM backbone: KAI Evolve

Model under training: Qwen/Qwen3.5-4B · Self-Distillation Fine-Tuning · Trained on Tinker

Teaching a Model to Teach Itself