Let the Speedrun Search Itself

Eval-gated config-only autoresearch on the canonical super fp8 lane

resultstatus: result

Recently, I kept finding myself circling the same question: what did I actually want from all the new autoresearch repos?

I did not want another training stack, another config system, or another place for experiments to disappear and become folklore. What I wanted was the discipline: a written program, a narrow mutable surface, fixed budgets, explicit keep-or-discard decisions, and receipts for every attempt.

I wanted the agent to roam, but only inside a box that I trusted.

For nmoe, that box had to stay small. The whole point of this repo is that there is one real training path, one real metrics surface, and one real eval loop. If “autoresearch” meant bolting on a second stack, the exercise would miss the point.

This post is the first result from doing it the stricter way.

Evidence scope

This is a result post for one bounded campaign on the public nmoe surface.

SurfaceContract
mutation tierconfig-only
lanecanonical super fp8 speedrun
datacanonical speedrun train/val
primary objectivefinal_valid_loss
vetoCORE cannot drop by more than 0.002
cluster shape4 GPU workers in parallel
receipt boundaryrepro/0011.receipts.json

So the claim here is narrow on purpose. This is not code-editing autoresearch, and it is not a general statement that the LLM proposer always beats the deterministic fallback. It is one real campaign, on one real lane, with one machine-readable truth boundary.

What I wanted the controller to do

I did not want an “AI scientist” demo. I wanted something much more boring, which is why I trust it more.

The controller had to reuse the canonical nmoe.train path, touch only allowlisted config fields, spend a fixed budget per candidate, promote only when validation improved and CORE stayed inside budget, and write a receipt whether the candidate won, lost, or crashed.

The core contract in campaigns/speedrun_super_research.toml is:

[objective]
primary_metric = "final_valid_loss"
direction = "min"
min_delta_abs = 0.001

[objective.constraints]
required_metrics = ["core"]
max_core_drop = 0.002

[budget.benchmark]
steps = 512

[mutation]
tier = "config_only"

In practice that meant one canonical runner (python -m nmoe.cli.main campaign auto ...), one canonical data source (/data/speedrun/train and /data/speedrun/val), a 512-step benchmark budget per candidate, and a small override surface centered on aux_loss_alpha, lr_dense, lr_router, warmup_steps, and a few nearby dials.

The cluster contract mattered just as much as the metric contract. We ran 4 workers in parallel, with one candidate claim per worker and unique checkpoint roots per experiment. If workers share receipts or checkpoint roots, the whole thing stops being research and turns into scheduler noise.

The first useful negative result showed up immediately

The first wake-up was on aux_loss_alpha.

candidatefinal validation lossCOREdecision
seed (aux_loss_alpha=0.0001)5.1987-0.0169keep
aux_loss_alpha=0.000155.1950-0.0183keep
aux_loss_alpha=0.00055.1920-0.0208discard

That third row is the whole reason the eval gate exists.

If I had optimized validation loss alone, 0.0005 would have looked like the next champion. CORE fell past the allowed budget, so the controller rejected it. A seductive wrong answer is still a wrong answer.

The first 4-worker wave found the real dense-LR move

Once the controller had a viable wake-up point, I let four workers go at once.

That is where the campaign stopped feeling like a toy. It was no longer one scalar hill-climb. It had to survive concurrency, stale local baselines, and the fact that more than one candidate can be locally “kept” in the same wave.

candidatefinal validation lossCOREtokens/s/GPUmean CVdecision
lr_dense=0.00165.2729-0.020894.7k237.5discard
lr_dense=0.00205.1571-0.015397.8k211.6keep
lr_dense=0.00225.1270-0.015698.1k199.8keep
lr_router=0.00215.1932-0.016791.9k242.3keep
Autoresearch champion progression showing final validation loss and throughput across the kept global winners.
Global champion updates only. The controller walks from the seed to the final aux_loss_alpha=0.00012, lr_dense=0.0022 winner while improving both validation loss and throughput.

lr_dense=0.0022 was the first move that felt like a regime change rather than noise. Relative to the wake-up baseline (aux_loss_alpha=0.00015), it improved final validation loss by 0.0679 nats while also improving CORE. It ran a bit faster and lowered mean router CV too.

There is a subtle systems point hiding in this table. In a parallel wave, more than one candidate can be kept against the same older baseline. That is fine. The global champion still has to be chosen from the best kept receipt instead of whichever worker happened to finish last.

The refinement wave gave the cleanest answer of the whole run

The next wave refined around the new dense-LR champion.

candidatefinal validation lossCOREtokens/s/GPUmean CVdecision
aux_loss_alpha=0.00018, lr_dense=0.00225.1174-0.019397.8k210.7discard
aux_loss_alpha=0.00012, lr_dense=0.00225.1200-0.0136100.7k197.0keep
aux_loss_alpha=0.0002, lr_dense=0.00225.1286-0.020092.8k200.2discard
lr_router=0.0021, aux_loss_alpha=0.00015, lr_dense=0.00225.1455-0.018094.5k216.5discard
Refinement wave around lr_dense 0.0022 showing final validation loss and CORE, with the CORE gate highlighted.
The best raw validation-loss point in the refinement wave (aux_loss_alpha=0.00018) was vetoed by CORE. The kept winner (aux_loss_alpha=0.00012) is the one that clears the full contract.

This was the cleanest result in the whole campaign. 0.00018 posted the best raw validation loss in the wave and still lost because CORE fell through the floor. 0.00012 became the final champion because it improved validation loss and improved CORE. That is exactly the behavior I wanted.

The agent did not just chase the prettiest scalar. It found a tempting loss improvement, got told “no” by eval, and had to keep searching until it found a point that actually cleared the full contract.

The final champion

fieldvalue
aux_loss_alpha0.00012
lr_dense0.0022
final_valid_loss5.1200
CORE-0.0136
throughput100.7k tokens/s/GPU
mean router CV197.0

Relative to the seed receipt, that is 0.0787 nats better final validation loss (5.1987 -> 5.1200), about 1.51% lower final validation loss, +0.00323 better CORE (-0.0169 -> -0.0136), and about 3.08% higher throughput (97.7k -> 100.7k tokens/s/GPU).

So the final result is more than a smaller loss number. The winner also cleared the eval gate, improved router health, and ran a bit faster.

What I think we learned

The first lesson is that bounded config-only autoresearch is already useful. We did not need code mutation to find a real win on the canonical speedrun lane.

The second is that validation-first promotion still needs an eval veto. Twice, the best raw validation point in a local neighborhood was not the right answer under CORE.

The third is that dense LR mattered more than router LR on this slice. The first large improvement came from lr_dense, not router-bias tuning.

The fourth is that parallel search changes controller semantics. Keep-or-discard can be local to a wave baseline, but champion selection has to stay global and receipt-backed.

This remains a narrow result: one campaign, one public eval suite, one config-only search surface. But it is the first time the public nmoe surface has the loop shape I actually wanted — program, search, receipts, eval gate, cluster fanout, and a real kept winner.

Receipts

The bundle is repro/0011.receipts.json.

It carries the exported campaign artifacts, including blog_artifacts/0011_autoresearch_manifest.json, blog_artifacts/0011_autoresearch_progression.json, blog_artifacts/0011_autoresearch_receipts.json, and blog_artifacts/0011_autoresearch_run_summaries.json. The key raw campaign receipts are linked under campaign_runs/speedrun_super_research/benchmark/....

Schema-validation command:

python3 scripts/repro/verify_post_receipts.py \
  --repo-root . \
  --receipts-dir repro \
  --post 0011

Path validation still requires a runtime that has the referenced campaign_runs/ and blog_artifacts/ surfaces mounted under the chosen data root.

Receipts