The Speedrun Loop

A small-model speedrun is our fastest honest instrument for architecture research

resultstatus: result

Post 0003: a speedrun is our smallest honest architecture lab—anchor it to the outside world, score it on more than loss, then use it to compare dense and expert-sparse designs quickly.

Frontier architecture research has a tempo problem. Full runs are too expensive to think with, but toy runs are too fake to trust. The best compromise we have found is the speedrun: a small model, a hard training contract, and a scoreboard that turns ideas over in hours instead of weeks.

We did not invent this loop. Keller Jordan's modded-nanogpt speedruns showed how much research velocity you can get from a small dense model under a serious contract. We wanted the same thing inside nmoe so we could study the question we actually care about: when does expert sparsity buy something real, and under what contract?

This post introduces that instrument. The point is not that the current loop is perfect. The point is that it is already useful enough to do honest research.

The Instrument

Let a speedrun report for model variant m be

R(m;Ctrain,Ceval)=(LT(m),  QT(m),  ST(m)),R(m; C_{\mathrm{train}}, C_{\mathrm{eval}}) = \big(L_T(m),\; Q_T(m),\; S_T(m)\big),

where L_T(m) is the validation loss at horizon T, Q_T(m) is the end-of-run capability score, and S_T(m) is the supporting diagnostic surface: throughput, router telemetry, and any other signal that tells us whether an apparent win is real or pathological.

For that instrument to mean anything, three contracts have to be explicit. The anchor contract keeps us tied to a public reference point. The evaluation contract keeps us from optimizing a single smooth scalar and calling it science. The swap contract says what stays fixed when we compare dense and sparse models.

1. Anchor It To The Outside World

Our public anchor is Keller Jordan's modded-nanogpt June 2024 AdamW dense speedrun. It is a good anchor for exactly the reason we wanted one: published logs, a clear recipe neighborhood, and a concrete target.

The public anchor

The public anchor comes in two pieces. The June 2024 code state tells us what recipe neighborhood we borrowed from. The preserved public run record gives us the exact number we anchor against.

SurfaceProvenanceWhat it gives us
June 2024 code statemodded-nanogpt commit b6b0a0d36e6f1758a8d14d5fcd5f15ca5d19b891the small dense recipe neighborhood we adapted into nmoe
Preserved run recordrecords/track_1_short/2024-06-06_AdamW/f66d43d7-e449-4029-8adf-e8537bab49ea.logthe exact public anchor value: tel = 3.275959 at step 9536
Public recipe noterecords/track_1_short/2024-06-06_AdamW/README.mda remembered recipe note: lr=0.0018, warmup=250, warmdown=2000, betas=(0.9, 0.95)

That distinction matters. The log file is the numerical anchor. The code state is the recipe neighborhood. Our nmoe dense calibration lane keeps the same headline shape — 9536 steps, 256 x 2048 tokens/step, AdamW at 1.8e-3 / 0.1 / (0.9, 0.95) — while using the cleaner 256/2048 WSD decomposition that the rest of this series inherits.

Reference surfaces: modded-nanogpt June 2024 record · modded-nanogpt commit b6b0a0d

What we matched, and what we did not

Aspectmodded-nanogpt (June 2024)nmoeMatch?
TokenizerGPT-2 (50257)GPT-2 (50257 padded to 50304)~Yes
EmbeddingsTiedUntiedNo
FFN activationGELUSwiGLUNo
AttentionSDPA (scaled_dot_product_attention)SDPA (scaled_dot_product_attention)Yes
Embedding/logit gainsNoneNone in these bf16 calibration runsYes
Norm eps1e-51e-5Yes
DataFineWeb10B (GPT-2 tokenized)FineWeb10B (GPT-2 tokenized)~Yes
OptimizerAdamWAdamWYes
μP / init scalingNonePresent in nmoeNo
ScheduleWSD; preserved public README remembers 250/2000 from memoryWSD with 256/2048 decompositionClose

Those mismatches matter. Exact reproduction would require matching the architecture too. Because we intentionally do not do that, the right question becomes: did we get close enough to know whether our stack lives in the same universe?

We measure that with the anchor gap

Δanchor=LT(mnmoe dense)LT(mpublic anchor).\Delta_{\mathrm{anchor}} = L_T\big(m_{\mathrm{nmoe\ dense}}\big) - L_T\big(m_{\mathrm{public\ anchor}}\big).

At step 9536, the current dense calibration lands at:

Public anchor log @ step 9536: tel = 3.275959
Dense SDPA @ step 9536:     valid/loss = 3.480672
Gap:                        +0.204713 (~ +0.205)
Dense calibration curve comparing nmoe and modded-nanogpt.
The gap is real: at step 9536 we are about +0.205 nats behind the pinned public June 2024 log anchor. That is enough to keep us honest, even though it is not parity.

That result is not glamorous. It is exactly the kind of useful annoyance a speedrun is supposed to produce. We now know where the original dense lane sat relative to a public reference, and that was enough to start the research.

The closure pass carried the same idea one step further. Instead of using the public anchor only as a post-hoc comparison, we turned it into the live stop target target_loss = 3.28. Under that upgraded loop, the current dense bf16 and fp8 lanes both reach the target at step 8320; nvfp4 does not and ends at 3.3047 after the full 9536-step horizon. The anchor has become something better than a historical curiosity: it is now part of the instrument.

2. Decide What Counts As A Win

A speedrun with only one number teaches you to worship the wrong god.

Loss matters. It also lies by omission. You can improve the benchmark while quietly making the model worse, making the system much harder to run, or making routing dynamics brittle enough that the apparent win does not survive the next regime.

0002 argued that the honest MoE observation surface is a small bundle: optimization, capability, and router health. In the speedrun loop, that means the objective is better written as

J(m)=(LT(m),  QT(m)),J(m) = \big(L_T(m),\; Q_T(m)\big),

with diagnostics S_T(m) used to reject pathological wins.

For nmoe, the capability surface we care about most is CORE, inspired in part by the way nanochat treats end-of-run evaluation as first-class instead of an afterthought. The historical January calibration lanes that originally shaped this post still matter because they show the transition. Those runs were configured with eval_tasks = core, but they still ran with eval_enabled = false, so they gave us the loss-side answer without the capability-side answer.

The closure pass finishes that upgrade. We reran the loop as a 3 x 3 matrix — dense, MoE-64, MoE-256 crossed with bf16, fp8, and nvfp4 — with end-of-run CORE enabled on every lane. That is the instrument we actually wanted: the speedrun does not end when the loss target is crossed; it ends when the capability report arrives too.

That changes what the loop can say. In bf16, the sparse lanes reach the public loss target earlier than dense, but dense still keeps the best CORE. In fp8, MoE-256 reaches the target earliest and posts the best CORE. In nvfp4, none of the lanes reaches target at all. Without Q_T(m), those three stories collapse into one vague claim about "better loss curves." With it, the speedrun starts acting like a real architecture lab.

3. Use The Speedrun To Compare Dense And Sparse Designs

Once the loop is calibrated well enough, the next use is the one we actually built nmoe for: dense-vs-MoE architecture research.

For the first swap, we used a simple comparison axis: active FFN width per token.

  • dense FFN active width per token: W_active = inter_dim
  • MoE FFN active width per token: W_active = (K + shared) * moe_inter_dim

For the closure matrix, we keep that axis fixed and then cross it with precision. The family members are:

ConfigExperts (E)Active (K)SharedActive FFN dim
Dense1103072
MoE-6464628 x 384 = 3072
MoE-256256718 x 384 = 3072

All nine closure lanes share the same June-style speedrun family: 9536 maximum steps, 256 x 2048 tokens/step, AdamW at 1.8e-3 / 0.1 / (0.9, 0.95), SDPA, target_loss = 3.28, and end-of-run CORE.

The completed matrix says:

DtypeModelStopVal LossCORE
bf16Dense8320 (target)3.27250.060865
bf16MoE-645760 (target)3.27690.050558
bf16MoE-2564864 (target)3.27780.051878
fp8Dense8320 (target)3.27580.057261
fp8MoE-646272 (target)3.26840.060741
fp8MoE-2564864 (target)3.27960.070765
nvfp4Dense9536 (full run)3.30470.049430
nvfp4MoE-649536 (full run)3.59500.030931
nvfp4MoE-2569536 (full run)3.45730.049050
Nine-lane speedrun closure matrix across bf16, fp8, and nvfp4 for dense, MoE-64, and MoE-256, showing stop step, final loss, and CORE.
The same June-style speedrun contract tells three different stories. In bf16, sparse reaches target earlier but dense still keeps the best CORE. In fp8, MoE-256 wins both on stop step and on CORE. In nvfp4, none of the lanes reaches target, but MoE-256 is clearly healthier than MoE-64.

Three things jump out.

First, in bf16 and fp8, expert sparsity is not a toy effect. Both sparse families reach the public loss target sooner than dense, and MoE-256 is fastest in both precisions.

Second, CORE prevents the loss answer from pretending to be the whole answer. In bf16, dense still has the strongest capability score even though it reaches the loss target later. In fp8, MoE-256 wins both on stop step and on CORE. The "best" architecture depends on what the instrument is actually asked to optimize.

Third, nvfp4 changes the story rather than just shifting it. None of the three nvfp4 lanes reaches target. But MoE-256 is materially healthier than MoE-64 on both loss and CORE, which is exactly the kind of precision-by-architecture interaction the speedrun loop is supposed to expose quickly.

That is exactly what we wanted from the speedrun: not a final scaling law, but a fast honest answer to whether a sparse direction is worth more of our time under a specific contract.

4. The Model Answer And The Systems Answer Arrive Together

One of the nicest things about the speedrun is that it gives the model answer and the systems answer almost at the same time.

The historical bf16 surface that originally motivated this post is still useful here. Before the closure matrix had end-of-run CORE, it already told us something important: MoE could win on loss while paying a steep systems tax.

Historical speedrun throughput overlay: dense vs MoE-64 vs the longer MoE-256 slice.
Historical bf16 rank-0 telemetry for the same 524,288 tokens/step family. The sparse lanes win on loss before they win on throughput.

Median over steps 1000–7000 on that historical surface:

Modeltokens/s/GPUms/stepTFLOPs
Dense SDPA~93k~705ms~137
MoE-64 bf16~48k~1367ms~71
Historical MoE-256 iso-tokens/expert slice~47k~1404ms~70

That divergence is one of the main reasons to use a speedrun in the first place. It tells us quickly when a modeling gain is being bought with a systems tax we have not earned back yet.

5. Why Sparse Comparisons Get Messy So Quickly

Even after the closure matrix, the sparse comparison surface gets complicated.

At fixed tokens per step and fixed horizon, the signal available to each routed expert falls roughly as K / E. On the completed full-horizon nvfp4 closure, the difference is already large:

ConfigExperts (E)Routed KTotal TokensTokens/Expert
MoE-646464.9996B468.7M
MoE-25625674.9996B136.7M
Super-409640967~6.3B10.8M
Tokens per routed expert vs number of experts for MoE-64, MoE-256, and Super-4096.
At fixed tokens/step, increasing E reduces tokens per routed expert roughly as K/E. The dedicated MoE-256 closure lane removes one old confound, but not the deeper one.

That is why sparse research gets confusing so quickly. “More experts” is not one clean change. It moves capacity, systems cost, and signal per expert at the same time.

This is the real handoff to 0004. 0003 introduces the speedrun loop and closes the first architecture-by-precision matrix. 0004 asks the next question: once sparse comparisons are this confounded, what fairness contract should organize the family at all?

What The Speedrun Buys Us

After this post, I want the reader to carry five ideas forward:

IdeaWhy it matters
Speedruns are real research instrumentsThey let us turn ideas over in hours instead of weeks.
Public anchors can become operational stop rules as well as post-hoc comparisonsThe Keller Jordan baseline now acts as both an honesty boundary and an operational target.
The objective is loss plus capabilityThe completed 3 x 3 closure matrix shows why CORE changes the story.
Precision can flip the sparse answerbf16, fp8, and nvfp4 do not tell the same dense-vs-MoE story.
Dense-vs-sparse comparisons need a declared contractActive FFN width per token is a first pass; the harder fairness story starts next.

Receipts

  • bundle: repro/0003.receipts.json

This bundle backs three layers of evidence: the pinned public June 2024 anchor (code commit, record directory, exact log path, and step-9536 value), the historical January calibration lanes that originally exposed the anchor gap and throughput tax, and the completed March 2026 3 x 3 closure matrix across dense / MoE-64 / MoE-256 and bf16 / fp8 / nvfp4, including exact stop steps, final losses, CORE scores, and cluster log paths for every lane.

Schema-validation command:

python3 scripts/repro/verify_post_receipts.py \
  --repo-root . \
  --receipts-dir repro \
  --post 0003

Path validation still requires --check-paths plus the referenced artifact mount.

Receipts