Super-4096

Loss keeps improving while routing collapses under extreme sparsity

resultstatus: result

Post 0006: push E to 4096, hold the token budget fixed, and watch loss stay respectable while the routed system stops acting like the model you thought you were training.

I ran Super-4096 to find the boundary, not to get a healthy model.

After the 0005 corrections, I no longer trusted the old Super-4096 run enough to lean on it casually. So I reran it on the corrected stack before writing this version. The basic surprise survived.

If you hold total tokens fixed and crank routed experts from 64 to 4096, you are starving each expert on purpose. The question is not whether that is dangerous. The question is what fails first, what easy stories survive falsification, and what the dashboard still cannot tell you even after the falsifiers land.

That last part is why 0007 exists.

The stress test

AspectValue
ModelSuper-4096 (depth=12, dim=768, heads=12, routed E=4096, routed K=7, shared=1)
Tokens12000 steps x 524k tokens/step = 6.291B
Schedulewarmup=256 steps, warmdown=2048 steps (June-style)
Precisionbf16 (not nvfp4, to isolate sparsity effects)
DataFineWeb10B (GPT-2 tokenized), deterministic stream
Evalvalid/loss every 128 steps

The arithmetic already says this run is hostile.

configexperts (E)active (K)tokens/routed-expert at 6.291B
MoE-64646589.8M
Ultra-2562567172.0M
Super-40964096710.8M

At the same total budget, Super-4096 gives each routed expert about 10.8M tokens. MoE-64 would give each expert 589.8M on the same budget, roughly 55x more signal.

That is the designed stress variable in this run.

What failed

I expected some visible breakdown:

  • loss plateau
  • obvious router failure
  • numerical instability

What I got was more deceptive.

Loss kept improving. Training stayed stable. And the routed system still collapsed into something much smaller than the config advertised.

That is what makes this run useful. It is the cleanest example I have of why MoE needs health metrics beyond the scalar objective.

The corrected-stack collapse signature

The trusted corrected-stack rerun still shows the basic 0006 phenomenon immediately.

train stepmean CV%max_load
1001589.8711.26%
2001333.8010.68%
4001502.9411.34%
5001544.9911.63%
20001652.5913.42%

By valid@512, loss is already down to 4.7179. By valid@2048, it is 3.6494. If you only watch loss, this run still looks alive.

The router telemetry says otherwise.

Depth is nonuniform on the corrected stack

One thing the corrected rerun changed is the layer-order story.

The old draft talked as if layer 00 collapsed first and everything else followed. The corrected receipts do not support that. On the current stack, the earliest saturation happens in higher-index layers.

Using the threshold max_load >= 0.9 * (1 / K), the first crossing times on the corrected baseline are:

LayerFirst step crossing 0.9 * (1 / K)
00700
033600
05400
06100
11100

So the honest statement is not "early layers collapse first." The honest statement is:

  • collapse is depth-nonuniform
  • a subset of higher-index layers saturates first
  • lower-index layers catch up later
  • by the time loss looks good, the whole stack is already compromised

That is a stronger precursor to 0007 anyway, because it makes the dashboard less narratively convenient and more revealing.

First falsifier: aux alone matters a lot

The laziest sentence in the old draft was that aux probably would not cure this. The corrected reruns killed that sentence.

I ran a clean bias-off pair so aux could be tested directly:

  • control: router_bias_update_rate = 0.0, aux_loss_alpha = 0.0
  • treatment: router_bias_update_rate = 0.0, aux_loss_alpha = 1e-4

The treatment is a real falsifier.

runvalid@512mean CV%@500max_load@500valid@2048mean CV%@2000max_load@2000
Baseline 40964.71791544.9911.63%3.64941652.5913.42%
Aux-only (bias0, aux=1e-4)4.6906870.134.17%3.5728955.374.40%

That is not a cosmetic change.

Aux materially changes the collapse geometry, and by 2048 it is also better on validation loss. So the old casual line "aux probably would not cure this" is dead.

What survives is the stronger version:

  • the corrected Super-4096 baseline still collapses
  • ordinary balancing pressure can change that regime a lot
  • so collapse is not an inevitable property of the number 4096

Second falsifier: tokens per expert matters, but not enough by itself

Lowering E while keeping the rest of the contract fixed increases tokens per expert. That definitely matters. But it does not solve the problem in the simple threshold-law way the old draft flirted with.

Matched at 512:

runvalid/lossmean CV%min entropyworst-layer E_eff = exp(min_entropy)layer 11 max_load
4096 baseline4.71791544.992.617313.714.26%
2048 experts4.71291095.612.635414.014.16%
1024 experts4.7174774.342.724015.214.04%
4096 + aux-only4.6906878.293.855447.34.03%

Matched at 2048:

runvalid/lossmean CV%
4096 baseline3.64941652.59
2048 experts3.59641075.12
1024 experts3.7272795.86
4096 + aux-only3.5728955.37

This is the important shape:

  • lowering E clearly lowers CV
  • lowering E materially changes entropy and can help optimization, but not monotonically across the first matched controls
  • but the high-index layers still pin near the same 1 / K ceiling
  • 2048 is the best non-aux control in this window, but it still looks much closer to the collapse regime than to the aux-changed regime

So tokens per expert is real. It is just not sufficient by itself.

What this run now proves

On the corrected stack, I think 0006 earns the following stronger version of the paper:

  1. extreme sparsity can keep improving loss while collapsing the effective routed system
  2. the collapse is severe, measurable, and already obvious in router telemetry long before loss would disqualify the run
  3. ordinary balancing pressure materially changes the collapse geometry and improves loss
  4. more tokens per expert matters, but it does not by itself remove the hard saturation regime
  5. the dashboard certifies damage, but it still does not tell me what object was damaged

That fifth point is the bridge.

Why 0007 has to exist

0006 now does exactly what I want an empirical precursor to do.

It proves the failure. It falsifies two easy stories:

  • "aux probably does not matter here"
  • "tokens per expert is the whole explanation"

And after those falsifiers land, it still leaves me with the same uncomfortable question:

What exactly is the thing that pretraining was building, and what exactly is the thing this collapse damaged?

The router dashboard is enough to convict the run. It is not enough to name the object that failed.

That is where 0007 begins.

Receipts

The receipt bundle for this corrected surface is nmoe/repro/0006.receipts.json.

The current scope is:

  • trusted corrected-stack Super-4096 baseline
  • clean aux-only falsifier
  • matched E=1024 and E=2048 controls

The bundle is still partial. The main remaining public gap is per-expert load histograms / CCDFs.

Super-4096 did what I needed it to do. It proved that loss can hide collapse, forced me to kill a few lazy explanations, and made 0007 necessary.

Receipts