Noumena

Post 0006: push E to 4096, hold the token budget fixed, and watch loss stay respectable while the routed system stops acting like the model you thought you were training.

I ran Super-4096 to find the boundary, not to get a healthy model.

After the 0005 corrections, I no longer trusted the old Super-4096 run enough to lean on it casually. So I reran it on the corrected stack before writing this version. The basic surprise survived.

If you hold total tokens fixed and crank routed experts from 64 to 4096, you are starving each expert on purpose. The question is not whether that is dangerous. The question is what fails first, what easy stories survive falsification, and what the dashboard still cannot tell you even after the falsifiers land.

That last part is why 0007 exists.

The stress test

Aspect	Value
Model	Super-4096 (depth=12, dim=768, heads=12, routed E=4096, routed K=7, shared=1)
Tokens	12000 steps x 524k tokens/step = 6.291B
Schedule	warmup=256 steps, warmdown=2048 steps (June-style)
Precision	bf16 (not nvfp4, to isolate sparsity effects)
Data	FineWeb10B (GPT-2 tokenized), deterministic stream
Eval	valid/loss every 128 steps

The arithmetic already says this run is hostile.

config	experts (E)	active (K)	tokens/routed-expert at 6.291B
MoE-64	64	6	589.8M
Ultra-256	256	7	172.0M
Super-4096	4096	7	10.8M

At the same total budget, Super-4096 gives each routed expert about 10.8M tokens. MoE-64 would give each expert 589.8M on the same budget, roughly 55x more signal.

That is the designed stress variable in this run.

What failed

I expected some visible breakdown:

loss plateau
obvious router failure
numerical instability

What I got was more deceptive.

Loss kept improving. Training stayed stable. And the routed system still collapsed into something much smaller than the config advertised.

That is what makes this run useful. It is the cleanest example I have of why MoE needs health metrics beyond the scalar objective.

The corrected-stack collapse signature

The trusted corrected-stack rerun still shows the basic 0006 phenomenon immediately.

train step	mean CV%	max_load
100	1589.87	11.26%
200	1333.80	10.68%
400	1502.94	11.34%
500	1544.99	11.63%
2000	1652.59	13.42%

By valid@512, loss is already down to 4.7179. By valid@2048, it is 3.6494. If you only watch loss, this run still looks alive.

The router telemetry says otherwise.

Depth is nonuniform on the corrected stack

One thing the corrected rerun changed is the layer-order story.

The old draft talked as if layer 00 collapsed first and everything else followed. The corrected receipts do not support that. On the current stack, the earliest saturation happens in higher-index layers.

Using the threshold max_load >= 0.9 * (1 / K), the first crossing times on the corrected baseline are:

Layer	First step crossing `0.9 * (1 / K)`
00	700
03	3600
05	400
06	100
11	100

So the honest statement is not "early layers collapse first." The honest statement is:

collapse is depth-nonuniform
a subset of higher-index layers saturates first
lower-index layers catch up later
by the time loss looks good, the whole stack is already compromised

That is a stronger precursor to 0007 anyway, because it makes the dashboard less narratively convenient and more revealing.

First falsifier: aux alone matters a lot

The laziest sentence in the old draft was that aux probably would not cure this. The corrected reruns killed that sentence.

I ran a clean bias-off pair so aux could be tested directly:

control: router_bias_update_rate = 0.0, aux_loss_alpha = 0.0
treatment: router_bias_update_rate = 0.0, aux_loss_alpha = 1e-4

The treatment is a real falsifier.

run	valid@512	mean CV%@500	max_load@500	valid@2048	mean CV%@2000	max_load@2000
Baseline 4096	4.7179	1544.99	11.63%	3.6494	1652.59	13.42%
Aux-only (bias0, aux=1e-4)	4.6906	870.13	4.17%	3.5728	955.37	4.40%

That is not a cosmetic change.

Aux materially changes the collapse geometry, and by 2048 it is also better on validation loss. So the old casual line "aux probably would not cure this" is dead.

What survives is the stronger version:

the corrected Super-4096 baseline still collapses
ordinary balancing pressure can change that regime a lot
so collapse is not an inevitable property of the number 4096

Second falsifier: tokens per expert matters, but not enough by itself

Lowering E while keeping the rest of the contract fixed increases tokens per expert. That definitely matters. But it does not solve the problem in the simple threshold-law way the old draft flirted with.

Matched at 512:

run	valid/loss	mean CV%	min entropy	worst-layer `E_eff = exp(min_entropy)`	layer 11 max_load
4096 baseline	4.7179	1544.99	2.6173	13.7	14.26%
2048 experts	4.7129	1095.61	2.6354	14.0	14.16%
1024 experts	4.7174	774.34	2.7240	15.2	14.04%
4096 + aux-only	4.6906	878.29	3.8554	47.3	4.03%

Matched at 2048:

run	valid/loss	mean CV%
4096 baseline	3.6494	1652.59
2048 experts	3.5964	1075.12
1024 experts	3.7272	795.86
4096 + aux-only	3.5728	955.37

This is the important shape:

lowering E clearly lowers CV
lowering E materially changes entropy and can help optimization, but not monotonically across the first matched controls
but the high-index layers still pin near the same 1 / K ceiling
2048 is the best non-aux control in this window, but it still looks much closer to the collapse regime than to the aux-changed regime

So tokens per expert is real. It is just not sufficient by itself.

What this run now proves

On the corrected stack, I think 0006 earns the following stronger version of the paper:

extreme sparsity can keep improving loss while collapsing the effective routed system
the collapse is severe, measurable, and already obvious in router telemetry long before loss would disqualify the run
ordinary balancing pressure materially changes the collapse geometry and improves loss
more tokens per expert matters, but it does not by itself remove the hard saturation regime
the dashboard certifies damage, but it still does not tell me what object was damaged

That fifth point is the bridge.

Why `0007` has to exist

0006 now does exactly what I want an empirical precursor to do.

It proves the failure. It falsifies two easy stories:

"aux probably does not matter here"
"tokens per expert is the whole explanation"

And after those falsifiers land, it still leaves me with the same uncomfortable question:

What exactly is the thing that pretraining was building, and what exactly is the thing this collapse damaged?

The router dashboard is enough to convict the run. It is not enough to name the object that failed.

That is where 0007 begins.

Receipts

The receipt bundle for this corrected surface is nmoe/repro/0006.receipts.json.

The current scope is:

trusted corrected-stack Super-4096 baseline
clean aux-only falsifier
matched E=1024 and E=2048 controls

The bundle is still partial. The main remaining public gap is per-expert load histograms / CCDFs.

Super-4096 did what I needed it to do. It proved that loss can hide collapse, forced me to kill a few lazy explanations, and made 0007 necessary.

Super-4096

The stress test

What failed

The corrected-stack collapse signature

Depth is nonuniform on the corrected stack

First falsifier: aux alone matters a lot

Second falsifier: tokens per expert matters, but not enough by itself

What this run now proves

Why 0007 has to exist

Receipts

Receipts

Why `0007` has to exist