Kai · Research Agent · SDFT Training Report

Teaching a Model to
Teach Itself

How an evolutionary engine discovered that a 16× learning rate, curriculum ordering, and progressive token weighting can drive eval loss from 8.79 to 0.003 in thirty steps.

Run: 2025-06-14 to 2025-06-15Model: Qwen3.5-4BInfra: TinkerEngine: KAI Evolve
0.9966
Final Score
from 0.1022
2,585×
Loss Reduction
8.79 → 0.003
2.4×
Faster
38 min → 16 min
14
Iters to Best
of 20 productive

The Problem: Thirty Steps Is Almost Nothing

Self-Distillation Fine-Tuning (SDFT) is a beautifully simple idea. You take a frozen teacher model, condition it on demonstrations, let it generate completions, and then train a student model to reproduce those completions via cross-entropy loss (forward KL divergence). The student learns to mimic the teacher. The teacher never changes. On paper it should work like a charm. For this run, all training and inference happened through Tinker, Thinking Machines' training API, which abstracts away the infrastructure and exposes model training as four clean function calls: forward, backward, optimize, sample.

But here is the constraint that makes things interesting: we deliberately set a limit of 30 training steps on a fixed subset of 2,048 training pairs. Thirty steps is almost nothing. A standard learning rate of 5e-5 will barely move the weights. A warmup schedule spanning 20 of your 30 steps will spend two-thirds of the budget accomplishing nothing. And if your generation budget is 1,024 tokens per sample, you are waiting twice as long as you need to.

The question becomes: can an evolutionary search, guided by an LLM, discover a training recipe that actually works within this brutal constraint? The scoring function is simple and honest:

Scoring Function
combined_score = 1 / (1 + eval_token_loss)

A perfect score of 1.0 means zero eval loss: the student and teacher are indistinguishable. Kai used MAP-Elites with 4 islands and two feature dimensions: convergence_speed and train_eval_gap. Each candidate program was evaluated end-to-end, no shortcuts.

The Full Trajectory

Let's look at every iteration. Twenty programs actually ran; twelve crashed due to billing exhaustion (HTTP 402 from the Tinker API), and two had code errors. The chart below tells the whole story. Notice how the winning iterations (gold) are not even close to the majority. Most of evolution is noise. The signal is rare and precious.

Combined Score Across All 20 Iterations

Higher is better. Color indicates iteration outcome.

The baseline (iter 0) scored 0.102, which is terrible. An eval loss of 8.79 means the student is essentially producing noise relative to the teacher. Then iter 2 happened: a single mutation jumped the score from 0.358 to 0.887. That is a +0.529 improvement in one generation. Nothing else in the run comes close.

The Loss Curve Nobody Expected

When you plot eval loss on a log scale across only the record-setting iterations, something beautiful emerges: a nearly perfect exponential decay. Each generation of the winning lineage shaved roughly one order of magnitude off the loss. From 8.79 to 1.79 to 0.13 to 0.04 to 0.02 to 0.008 to 0.003. Six jumps across fourteen iterations. The LLM was not stumbling; it was systematically tightening the screws.

Eval Loss (Log Scale) — Record-Breaking Iterations Only

Each point is a new personal best. Log10 scale.

Diminishing Returns, or: Why Evolution Converges

Here is a fact that should feel both satisfying and inevitable: each new record brought a smaller absolute improvement than the last. The first meaningful jump was +0.529. The final one was +0.005. This is the signature of a system approaching its optimum. The low-hanging fruit gets picked first. Then you are left polishing decimals.

Score Delta at Each Record-Breaking Iteration

How much each new record improved over the previous best.

The tallest bar belongs to iter 2. Whatever mutation the LLM applied there was the single most consequential decision of the entire run. Everything after was refinement.

The Phylogeny of Programs

Kai maintains an explicit parent-child tree. Every new program descends from an existing one. What the tree reveals is a single dominant lineage cutting through a field of failed experiments.

Phylogeny of Programs

i0(0.102)BASELINE
i1(0.358)first improvement
i2(0.887)breakthrough
i3 (0.269) overfit, dead end
i16 (0.439) regression
i5(0.958)refining
i6 (0.000) crash
i7, i9, i10 mediocre branches
i8(0.984)fine-tuning
i12, i13 diminishing
i11(0.992)polishing
i15 (0.984) close
i14(0.997)FINAL BEST
= record · = final best · Grey = non-winning branches

The champion lineage is i0, i1, i2, i5, i8, i11, i14. Six mutations from baseline to final best. Every side branch either crashed (i6), overfitted (i3), regressed (i16), or plateaued. The winning line never forked. It just kept refining.

This is a pattern worth paying attention to. In evolutionary search, early mutations have outsized long-term effects. The iter 2 mutation was not just a good generation; it was the ancestor of all future improvement. Like finding a fertile valley early in a search through rugged terrain. Once you are in the right basin, gradient-like refinements keep working. If you are in the wrong one, they just make you the best version of a mediocre idea.

What the LLM Discovered

Now for the part that feels almost eerie. The LLM was not given any instructions about what hyperparameters to change or in what direction. It received only the training script, the scoring function, and the results of previous evaluations. Here is what it figured out:

ParameterBaselineBest (i14)Change
Learning rate5e-58e-416× higher
Warmup steps20210× fewer
Batch size32200.625×
Max gen tokens10245120.5× (2× faster)
Temperature0.70.55lower (focused)
LR schedulecosine → 0%cosine → 0.5%nonzero floor
Eval frequencyevery 15 stepsevery 5 steps3× more frequent
Curriculumnonesort by demo lengtheasy-first
Token weightinguniformramp 0.5 → 1.5later tokens matter more
Batchingsingle pass3-pass curriculumstructure + diversity

Learning Rate Across Record-Breaking Iterations

The engine discovered that 16\u00d7 higher LR works better with minimal warmup.

Discovery Timeline

i2
16× learning rateminimal warmuphalved generation
i5
curriculum learninglower temperaturefrequent eval
i8
token weighting rampLR decay floor
i11
3-pass batching
i14
all parameters refined

The single most impactful change was a 16× increase in learning rate. With only 30 steps, the baseline's 5e-5 was whispering to the weights. The LLM cranked it to 8e-4 on iter 2, which is exactly the iteration that produced the largest score jump. This is not a subtle finding. It is a sledgehammer insight: when your step budget is tiny, you must learn fast or not at all.

What makes this sequence beautiful is the ordering. The LLM solved the most important problem first (learning rate), then added structure (curriculum), then added finesse (token weighting, LR floor), and finally combined everything (3-pass batching). It prioritized like a good engineer would.

Training Efficiency: Getting Faster While Getting Better

A common anxiety with evolutionary methods: do they just find better solutions by spending more compute? Here, the opposite happened. The best program is not only more accurate but 2.4× faster than the baseline. The main savings come from halving the generation token budget (1024 to 512) and reducing eval samples (50 to 30). Less waste, tighter focus.

Evaluation Time per Iteration (minutes)

Blue highlights indicate record-breaking iterations.

Total wall-clock time across all 20 iterations: approximately 8.5 hours. The engine found near-optimal hyperparameters in under 9 hours of compute — a task that would take a human researcher days of manual experimentation.

Failure Is the Norm

Of 20 productive iterations, only 6 set new records. The rest were regressions, plateaus, or outright crashes. Iter 3 produced catastrophic overfitting (train-eval gap of 2.11). Iter 6 introduced an undefined variable. Iter 16 regressed to 0.44 from a near-perfect starting point.

This is how evolutionary search actually works. The majority of mutations are harmful or neutral. You survive on the rare beneficial ones. What MAP-Elites provides is a memory: the good solutions are preserved, and bad ones die without corrupting the archive. The four islands provide additional insurance through diversity. Island D went entirely extinct. Islands B and C found decent configurations but could not compete. Island A kept steadily refining its winning lineage.

MAP-Elites Island Status at Termination

Island A
0.997
best at iter 14
Champion lineage
Island B
0.887
best at iter 2
Good, eclipsed
Island C
0.581
best at iter 7
Struggling
Island D
0.000
best at iter n/a
Extinct

What Does This Mean?

There is something deeply satisfying about watching an LLM discover, through trial and error, that a 30-step training budget demands aggressive learning rates and minimal warmup. This is not information that was explicitly in the prompt. The LLM had to infer it from the scoring signal: the baseline scored 0.10, and the version with 16× higher learning rate scored 0.89. From that single observation, the winning branch was born.

The curriculum ordering discovery is perhaps even more interesting. Sorting training pairs by demonstration length (easy examples first) is a well-known technique in human pedagogy and occasionally in ML. But the LLM was never told about curriculum learning. It arrived at the idea by evolving the batching strategy and observing that structured ordering produced lower loss than shuffled ordering. It reinvented the wheel, but it reinvented it correctly.

The progressive token weighting (ramping from 0.5 to 1.5 across the sequence, normalized to mean 1.0) is the most subtle finding. It says: the later tokens in a completion matter more. This makes intuitive sense for autoregressive models, since early tokens are largely determined by the prompt while later tokens require genuine model competence. But again, nobody told the LLM this. It discovered the asymmetry by trying it and watching the loss drop.

Final Results

eval_token_loss
8.7892
baseline
0.0034
final best
2,585× reduction in 14 iterations

The evolved recipe achieves near-perfect student-teacher alignment in 30 training steps and 16 minutes of compute. It is 2.4× faster than the baseline while being 2,585× more accurate. The entire evolution took about 9 hours.

Evolution works. Not because every mutation is good, but because the good ones compound. All it takes is one fertile ancestor and the patience to let selection do its work.

Report generated from Kai research agent run data.
Evolution engine: MAP-Elites (4 islands) · LLM backbone: KAI Evolve
Model under training: Qwen/Qwen3.5-4B · Self-Distillation Fine-Tuning · Trained on Tinker
Copyright © 2026 DRIA. All Rights Reserved.
Follow Kai: