Super-4096
Loss keeps improving while routing collapses under extreme sparsity
Post 0006: push
Eto4096, hold the token budget fixed, and watch loss stay respectable while the routed system stops acting like the model you thought you were training.
I ran Super-4096 to find the boundary, not to get a healthy model.
After the 0005 corrections, I no longer trusted the old Super-4096 run enough to lean on it casually. So I reran it on the corrected stack before writing this version. The basic surprise survived.
If you hold total tokens fixed and crank routed experts from 64 to 4096, you are starving each expert on purpose. The question is not whether that is dangerous. The question is what fails first, what easy stories survive falsification, and what the dashboard still cannot tell you even after the falsifiers land.
That last part is why 0007 exists.
The stress test
| Aspect | Value |
|---|---|
| Model | Super-4096 (depth=12, dim=768, heads=12, routed E=4096, routed K=7, shared=1) |
| Tokens | 12000 steps x 524k tokens/step = 6.291B |
| Schedule | warmup=256 steps, warmdown=2048 steps (June-style) |
| Precision | bf16 (not nvfp4, to isolate sparsity effects) |
| Data | FineWeb10B (GPT-2 tokenized), deterministic stream |
| Eval | valid/loss every 128 steps |
The arithmetic already says this run is hostile.
| config | experts (E) | active (K) | tokens/routed-expert at 6.291B |
|---|---|---|---|
| MoE-64 | 64 | 6 | 589.8M |
| Ultra-256 | 256 | 7 | 172.0M |
| Super-4096 | 4096 | 7 | 10.8M |
At the same total budget, Super-4096 gives each routed expert about 10.8M tokens. MoE-64 would give each expert 589.8M on the same budget, roughly 55x more signal.
That is the designed stress variable in this run.
What failed
I expected some visible breakdown:
- loss plateau
- obvious router failure
- numerical instability
What I got was more deceptive.
Loss kept improving. Training stayed stable. And the routed system still collapsed into something much smaller than the config advertised.
That is what makes this run useful. It is the cleanest example I have of why MoE needs health metrics beyond the scalar objective.
The corrected-stack collapse signature
The trusted corrected-stack rerun still shows the basic 0006 phenomenon immediately.
| train step | mean CV% | max_load |
|---|---|---|
| 100 | 1589.87 | 11.26% |
| 200 | 1333.80 | 10.68% |
| 400 | 1502.94 | 11.34% |
| 500 | 1544.99 | 11.63% |
| 2000 | 1652.59 | 13.42% |
By valid@512, loss is already down to 4.7179. By valid@2048, it is 3.6494. If you only watch loss, this run still looks alive.
The router telemetry says otherwise.
Depth is nonuniform on the corrected stack
One thing the corrected rerun changed is the layer-order story.
The old draft talked as if layer 00 collapsed first and everything else followed. The corrected receipts do not support that. On the current stack, the earliest saturation happens in higher-index layers.
Using the threshold max_load >= 0.9 * (1 / K), the first crossing times on the corrected baseline are:
| Layer | First step crossing 0.9 * (1 / K) |
|---|---|
| 00 | 700 |
| 03 | 3600 |
| 05 | 400 |
| 06 | 100 |
| 11 | 100 |
So the honest statement is not "early layers collapse first." The honest statement is:
- collapse is depth-nonuniform
- a subset of higher-index layers saturates first
- lower-index layers catch up later
- by the time loss looks good, the whole stack is already compromised
That is a stronger precursor to 0007 anyway, because it makes the dashboard less narratively convenient and more revealing.
First falsifier: aux alone matters a lot
The laziest sentence in the old draft was that aux probably would not cure this. The corrected reruns killed that sentence.
I ran a clean bias-off pair so aux could be tested directly:
- control:
router_bias_update_rate = 0.0,aux_loss_alpha = 0.0 - treatment:
router_bias_update_rate = 0.0,aux_loss_alpha = 1e-4
The treatment is a real falsifier.
| run | valid@512 | mean CV%@500 | max_load@500 | valid@2048 | mean CV%@2000 | max_load@2000 |
|---|---|---|---|---|---|---|
| Baseline 4096 | 4.7179 | 1544.99 | 11.63% | 3.6494 | 1652.59 | 13.42% |
| Aux-only (bias0, aux=1e-4) | 4.6906 | 870.13 | 4.17% | 3.5728 | 955.37 | 4.40% |
That is not a cosmetic change.
Aux materially changes the collapse geometry, and by 2048 it is also better on validation loss. So the old casual line "aux probably would not cure this" is dead.
What survives is the stronger version:
- the corrected Super-4096 baseline still collapses
- ordinary balancing pressure can change that regime a lot
- so collapse is not an inevitable property of the number
4096
Second falsifier: tokens per expert matters, but not enough by itself
Lowering E while keeping the rest of the contract fixed increases tokens per expert. That definitely matters. But it does not solve the problem in the simple threshold-law way the old draft flirted with.
Matched at 512:
| run | valid/loss | mean CV% | min entropy | worst-layer E_eff = exp(min_entropy) | layer 11 max_load |
|---|---|---|---|---|---|
| 4096 baseline | 4.7179 | 1544.99 | 2.6173 | 13.7 | 14.26% |
| 2048 experts | 4.7129 | 1095.61 | 2.6354 | 14.0 | 14.16% |
| 1024 experts | 4.7174 | 774.34 | 2.7240 | 15.2 | 14.04% |
| 4096 + aux-only | 4.6906 | 878.29 | 3.8554 | 47.3 | 4.03% |
Matched at 2048:
| run | valid/loss | mean CV% |
|---|---|---|
| 4096 baseline | 3.6494 | 1652.59 |
| 2048 experts | 3.5964 | 1075.12 |
| 1024 experts | 3.7272 | 795.86 |
| 4096 + aux-only | 3.5728 | 955.37 |
This is the important shape:
- lowering
Eclearly lowers CV - lowering
Ematerially changes entropy and can help optimization, but not monotonically across the first matched controls - but the high-index layers still pin near the same
1 / Kceiling 2048is the best non-aux control in this window, but it still looks much closer to the collapse regime than to the aux-changed regime
So tokens per expert is real. It is just not sufficient by itself.
What this run now proves
On the corrected stack, I think 0006 earns the following stronger version of the paper:
- extreme sparsity can keep improving loss while collapsing the effective routed system
- the collapse is severe, measurable, and already obvious in router telemetry long before loss would disqualify the run
- ordinary balancing pressure materially changes the collapse geometry and improves loss
- more tokens per expert matters, but it does not by itself remove the hard saturation regime
- the dashboard certifies damage, but it still does not tell me what object was damaged
That fifth point is the bridge.
Why 0007 has to exist
0006 now does exactly what I want an empirical precursor to do.
It proves the failure. It falsifies two easy stories:
- "aux probably does not matter here"
- "tokens per expert is the whole explanation"
And after those falsifiers land, it still leaves me with the same uncomfortable question:
What exactly is the thing that pretraining was building, and what exactly is the thing this collapse damaged?
The router dashboard is enough to convict the run. It is not enough to name the object that failed.
That is where 0007 begins.
Receipts
The receipt bundle for this corrected surface is nmoe/repro/0006.receipts.json.
The current scope is:
- trusted corrected-stack Super-4096 baseline
- clean aux-only falsifier
- matched
E=1024andE=2048controls
The bundle is still partial. The main remaining public gap is per-expert load histograms / CCDFs.
Super-4096 did what I needed it to do. It proved that loss can hide collapse, forced me to kill a few lazy explanations, and made 0007 necessary.