Noumena

Post 0009: treat one NVLink domain as the MoE computer, and let the dense path stay boring.

0000 already hinted at the problem. Ultra-sparse MoEs strand FLOPs in exactly the wrong place: the dense path still fits replicated, but the expert path goes cold because rows fragment across too many experts.

The thing that kept bothering me was that we were inheriting more and more dense-model synchronization machinery for the wrong part of the system. Tensor parallel and collective choreography kept expanding the operational surface even when the real bottleneck was much more physical: too few rows per expert, too much padding, too much cold grouped GEMM.

RDEP came from trying to fix both problems with one move: treat one NVLink fabric domain as the execution object, keep the dense path replicated, and pool the sparse path across the whole domain.

This is a result post. The quantitative claims below come from the paper and the linked receipt bundle.

The lever turned out to be pooled expert batching. As dense replicas widen from DP=1 to DP=8, the routed owner path moves from 905 TFLOPS / 36.2% useful MFU to 1.20 PFLOPS / 48.1% useful MFU. On the shared 8xB200 point, RDEP is 2.00x faster at 2.90x lower memory. The same design also survives the jump from one 8-GPU slice to the full 72-GPU GB300NVL72 fabric.

The problem I wanted to get rid of

Sparse MoE training on modern hardware kept breaking in two coupled ways. The dense path became more ceremonial than it needed to be, while the expert path stayed cold because rows fragmented across experts and grouped GEMMs never really saturated the machine.

RDEP attacks both with the same move: treat one NVLink domain as the main execution object, keep the dense path replicated, and pool the sparse path across the whole domain.

Side-by-side sync topology comparing a TP plus NCCL baseline against RDEP, with collective bars on every dense block in the baseline and mostly vertical execution in RDEP. — The execution-model picture from the paper. The baseline keeps inserting collective structure into the dense path, while RDEP keeps dense compute local and spends its systems budget on route-row dispatch and return.

The core move

Inside one NVLink domain, RDEP uses the same physical ranks for two logical roles. Front-end ranks route local tokens and later reconstruct outputs for those same local tokens. Owner ranks execute the shards of experts assigned to them.

The transport unit is a route row: one token-slot selection together with its activation, selected expert, gate weight, and source identity.

The forward path is intentionally explicit. Each rank routes its local tokens, materializes every selected token-slot as a route row, counts and packs rows by destination owner, dispatches them with explicit offsets, lets each owner run only the experts it owns, and then returns the outputs so the source rank can reconstruct the original token-slot outputs with the saved router weights.

Two rules keep that path sane. Dispatch and return operate on explicit route rows with explicit identities. And under the uniform-T assumption, the route row from source rank r, token t, and slot k is assigned

\mathrm{row\_id} = ((r \cdot T) + t) \cdot K + k.

That identity is carried through dispatch, return, and backward. Reconstruction uses explicit offsets plus row_id rather than assuming valid rows happen to occupy some convenient packed prefix.

Why pooled rows matter

The first-order performance variable in sparse MoEs is not total model FLOPs. It is rows per expert.

With E total routed experts, K selected experts per token, T local tokens per rank, and DP dense replicas, the expected rows per expert rise from

\mu_1 = \frac{T K}{E}

On one replica to

\mu = \frac{DP \cdot T K}{E}

When the same tokens are pooled across the whole NVLink domain.

That is the whole game. It is what changes grouped expert compute from cold to hot.

Pooling mechanism figure showing rows per expert increasing as dense replicas widen from DP=1 to DP=8 and grouped expert batches become hotter and less padded. — The mechanism picture from the paper. Widening the dense-replica pool does not just add traffic; it raises rows per expert, reduces padding waste, and moves grouped expert compute into a hotter operating regime.

The controlled ceiling

On one 8xB200 IPC domain, holding E=64, H=2048, Dff=1408, K=6, and T=4096 per rank fixed, controlled uniform routing gives:

DP	rows/expert `mu`	load CV	fwd p50 (ms)	tok/s	fwd+bwd tok/s
1	384	4.26%	1.282	3.19M	0.81M
2	768	3.56%	1.124	7.13M	1.89M
4	1,536	2.25%	1.140	14.24M	4.00M
8	3,072	1.51%	1.192	27.27M	8.13M

Three things matter at once: pooled rows per expert double exactly with each doubling of DP, load balance tightens rather than degrading in this controlled regime, and forward latency stays nearly flat while total throughput scales strongly.

The grouped-GEMM measurement that actually matters

The decisive measurement is not an isolated single-GEMM proxy. It is the actual grouped expert path on the routed owner.

Under the same controlled-uniform setup, direct grouped receipts show:

DP	rows/expert `mu`	useful TFLOPS	useful MFU	padding factor	grouped-MM p50 (ms)
1	384	905	36.2%	1.33	0.470
2	768	1,091	43.7%	1.17	0.392
4	1,536	1,178	47.1%	1.09	0.362
8	3,072	1,203	48.1%	1.04	0.355

This is the mechanism result in one table: widening the dense-replica pool makes the real routed owner path hotter and wastes less work on padding.

Skew still matters, just differently

RDEP does not remove routing quality from the problem. It changes the operating point the router gets to exploit.

At DP=8, the mean rows per expert stays fixed at 3072, but the lower tail collapses under skew.

Routing profile	p10 rows/expert	forward tok/s
controlled uniform	`3012.9`	`27.27M`
`alpha = 0.5`	`1878.4`	`22.40M`
`alpha = 1.0`	`873.4`	`20.73M`
`alpha = 1.5`	`358.1`	`17.00M`
`alpha = 2.0`	`140.8`	`20.13M`

The important nuance is that skew hurts RDEP mainly by owner makespan imbalance, not by making the hottest owner's grouped GEMM go cold. The bottleneck owner still sustains about 1.20-1.33 PFLOPS of useful grouped throughput under the skew sweep; what degrades is the global makespan because some owners get much hotter while the lower tail goes cold.

Where it paid off

The 8xB200 training result

The 8xB200 Moonlight receipt closes a real single-node training result.

Measurement	Value
total throughput	`84,299 tok/s`
train TFLOPS / GPU	`206.4`
model-adjusted ceiling	`62,654 tok/s`
realized scaling efficiency	`134.5%`

The >100% efficiency is not a measurement error. The ceiling is computed from single-replica active-param throughput × 8. RDEP pools expert rows across all replicas, so the grouped GEMMs run hotter than any single replica could achieve alone. That is the whole point of pooled batching.

That closes the loop as training, with a real throughput number and real heat in the routed path.

The overlap-region baseline

The best available shared-point comparison is Torchtitan's DeepSeek-V3 16B recipe on the same 8xB200 node. Not an exact model match, but the closest public TP + NCCL baseline in the overlap region.

At the largest jointly feasible point, 131,072 tokens/step:

System	Throughput (tok/s)	Mean step time (ms)	Peak memory / GPU
RDEP Moonlight-16B	24,871	5,274	58.92 GiB
Torchtitan TP + NCCL	12,447	10,532	171.10 GiB

At that shared point, RDEP is 2.00x faster and uses 2.90x less memory.

The feasibility frontier

The overlap-region result is only half the story. The bigger systems result is that RDEP admits workloads the collective baseline cannot.

System	Largest stable point	First failing point	Observed boundary
RDEP Moonlight-16B	`262,144 tok/step`	---	200-step run completed
Torchtitan TP + NCCL	`131,072 tok/step`	`196,608 tok/step`	OOM in routed MoE output buffer

That feasibility gap is part of the result. RDEP is better at the shared point and also enlarges the feasible training region on the same hardware.

Target-hardware validation on GB300 fabric

RDEP was built for GB300NVL72-class fabrics rather than only one 8-GPU IPC node. The target-hardware story has two receipt lanes.

First, a matched 2-tray / 8-GPU fabric slice with the same routed shape (T=4096, H=2048, E_local=8, K=6):

system	forward p50 / p99	fwd+bwd p50 / p99
B200 IPC	`1.22 / 1.53 ms`	`3.99 / 4.24 ms`
GB300 fabric	`1.26 / 1.43 ms`	`2.38 / 2.41 ms`

The protocol survives the IPC-to-fabric transition without giving up forward latency, and the backward path improves materially on GB300 fabric with tighter tails.

Second, the full 18-tray / 72-GPU domain passes smoke, dispatch invariants, transport microbenchmarks, and routed-compute sanity at domain width.

Representative full-width receipts:

Shape	forward p50 / p99	fwd+bwd p50 / p99
`T=1024`	`4.71 / 8.46 ms`	`6.32 / 8.23 ms`
`T=4096`	`6.01 / 34.27 ms`	`8.48 / 51.12 ms`

At full width, the remaining pressure point is tail latency, not whether the routed design survives domain width at all.

Why the production use matters

This matters to me because RDEP was not only a benchmark trick. We used it in sustained production training on GB300NVL72-class hardware and did not see protocol-level failures.

That is a different confidence level than "the microbench looked good once."

Why I think this matters

RDEP is more than an MoE transport tweak.

Once it worked, the system got simpler exactly where it mattered. Tensor parallelism leaves the in-domain MoE hot path whenever dense replication fits. The NVLink domain becomes the object that pools sparse expert work. Expert compute runs materially hotter. The system beats the closest public overlap-region baseline where both stacks are runnable, admits workloads the collective baseline cannot run on the same hardware, and survives the target hardware from one 8-GPU slice through the full 72-GPU domain.

That is the core result.

What it does not say

Three boundaries matter. Routing quality remains separate from transport efficiency: RDEP creates the opportunity for hot grouped expert compute, but the router still has to realize it. Full-width NVL72 looks strong at the median while remaining tail-sensitive, with dispatch tail latency still the main fabric-width pressure point at the hotter shapes. And the segmented overlap argument remains analytical here; this post carries the transport and compute terms needed for the overlap inequalities, but it does not claim that one integrated training path has already been directly traced to realize the full steady-state overlap bound.

Receipts

The canonical paper is RDEP, and the linked receipt bundle keeps that paper attached to this post. The paper is the truth boundary here; this post is the guided tour. Controlled-routing ceilings, Zipf sensitivity, direct grouped-GEMM results, the overlap-region comparison, the feasibility frontier, and the GB300 fabric results all come from that paper-side surface without inventing a second, thinner story.

The longer paper carries the full analytical treatment and exact measurement recipes. The linked bundle keeps that paper attached to this post without inventing a second truth boundary.