RDEP
keeping sparse expert compute hot across a whole NVLink fabric
Post 0009: treat one NVLink domain as the MoE computer, and let the dense path stay boring.
0000 already hinted at the problem. Ultra-sparse MoEs strand FLOPs in exactly the wrong place: the dense path still fits replicated, but the expert path goes cold because rows fragment across too many experts.
The thing that kept bothering me was that we were inheriting more and more dense-model synchronization machinery for the wrong part of the system. Tensor parallel and collective choreography kept expanding the operational surface even when the real bottleneck was much more physical: too few rows per expert, too much padding, too much cold grouped GEMM.
RDEP came from trying to fix both problems with one move: treat one NVLink fabric domain as the execution object, keep the dense path replicated, and pool the sparse path across the whole domain.
This is a result post. The quantitative claims below come from the paper and the linked receipt bundle.
The lever turned out to be pooled expert batching. As dense replicas widen from DP=1 to DP=8, the routed owner path moves from 905 TFLOPS / 36.2% useful MFU to 1.20 PFLOPS / 48.1% useful MFU. On the shared 8xB200 point, RDEP is 2.00x faster at 2.90x lower memory. The same design also survives the jump from one 8-GPU slice to the full 72-GPU GB300NVL72 fabric.
The problem I wanted to get rid of
Sparse MoE training on modern hardware kept breaking in two coupled ways. The dense path became more ceremonial than it needed to be, while the expert path stayed cold because rows fragmented across experts and grouped GEMMs never really saturated the machine.
RDEP attacks both with the same move: treat one NVLink domain as the main execution object, keep the dense path replicated, and pool the sparse path across the whole domain.
The core move
Inside one NVLink domain, RDEP uses the same physical ranks for two logical roles. Front-end ranks route local tokens and later reconstruct outputs for those same local tokens. Owner ranks execute the shards of experts assigned to them.
The transport unit is a route row: one token-slot selection together with its activation, selected expert, gate weight, and source identity.
The forward path is intentionally explicit. Each rank routes its local tokens, materializes every selected token-slot as a route row, counts and packs rows by destination owner, dispatches them with explicit offsets, lets each owner run only the experts it owns, and then returns the outputs so the source rank can reconstruct the original token-slot outputs with the saved router weights.
Two rules keep that path sane. Dispatch and return operate on explicit route rows with explicit identities. And under the uniform-T assumption, the route row from source rank r, token t, and slot k is assigned
That identity is carried through dispatch, return, and backward. Reconstruction uses explicit offsets plus row_id rather than assuming valid rows happen to occupy some convenient packed prefix.
Why pooled rows matter
The first-order performance variable in sparse MoEs is not total model FLOPs. It is rows per expert.
With E total routed experts, K selected experts per token, T local tokens per rank, and DP dense replicas, the expected rows per expert rise from
On one replica to
When the same tokens are pooled across the whole NVLink domain.
That is the whole game. It is what changes grouped expert compute from cold to hot.
The controlled ceiling
On one 8xB200 IPC domain, holding E=64, H=2048, Dff=1408, K=6, and T=4096 per rank fixed, controlled uniform routing gives:
| DP | rows/expert mu | load CV | fwd p50 (ms) | tok/s | fwd+bwd tok/s |
|---|---|---|---|---|---|
| 1 | 384 | 4.26% | 1.282 | 3.19M | 0.81M |
| 2 | 768 | 3.56% | 1.124 | 7.13M | 1.89M |
| 4 | 1,536 | 2.25% | 1.140 | 14.24M | 4.00M |
| 8 | 3,072 | 1.51% | 1.192 | 27.27M | 8.13M |
Three things matter at once: pooled rows per expert double exactly with each doubling of DP, load balance tightens rather than degrading in this controlled regime, and forward latency stays nearly flat while total throughput scales strongly.
The grouped-GEMM measurement that actually matters
The decisive measurement is not an isolated single-GEMM proxy. It is the actual grouped expert path on the routed owner.
Under the same controlled-uniform setup, direct grouped receipts show:
| DP | rows/expert mu | useful TFLOPS | useful MFU | padding factor | grouped-MM p50 (ms) |
|---|---|---|---|---|---|
| 1 | 384 | 905 | 36.2% | 1.33 | 0.470 |
| 2 | 768 | 1,091 | 43.7% | 1.17 | 0.392 |
| 4 | 1,536 | 1,178 | 47.1% | 1.09 | 0.362 |
| 8 | 3,072 | 1,203 | 48.1% | 1.04 | 0.355 |
This is the mechanism result in one table: widening the dense-replica pool makes the real routed owner path hotter and wastes less work on padding.
Skew still matters, just differently
RDEP does not remove routing quality from the problem. It changes the operating point the router gets to exploit.
At DP=8, the mean rows per expert stays fixed at 3072, but the lower tail collapses under skew.
| Routing profile | p10 rows/expert | forward tok/s |
|---|---|---|
| controlled uniform | 3012.9 | 27.27M |
alpha = 0.5 | 1878.4 | 22.40M |
alpha = 1.0 | 873.4 | 20.73M |
alpha = 1.5 | 358.1 | 17.00M |
alpha = 2.0 | 140.8 | 20.13M |
The important nuance is that skew hurts RDEP mainly by owner makespan imbalance, not by making the hottest owner's grouped GEMM go cold. The bottleneck owner still sustains about 1.20-1.33 PFLOPS of useful grouped throughput under the skew sweep; what degrades is the global makespan because some owners get much hotter while the lower tail goes cold.
Where it paid off
The 8xB200 training result
The 8xB200 Moonlight receipt closes a real single-node training result.
| Measurement | Value |
|---|---|
| total throughput | 84,299 tok/s |
| train TFLOPS / GPU | 206.4 |
| model-adjusted ceiling | 62,654 tok/s |
| realized scaling efficiency | 134.5% |
The >100% efficiency is not a measurement error. The ceiling is computed from single-replica active-param throughput × 8. RDEP pools expert rows across all replicas, so the grouped GEMMs run hotter than any single replica could achieve alone. That is the whole point of pooled batching.
That closes the loop as training, with a real throughput number and real heat in the routed path.
The overlap-region baseline
The best available shared-point comparison is Torchtitan's DeepSeek-V3 16B recipe on the same 8xB200 node. Not an exact model match, but the closest public TP + NCCL baseline in the overlap region.
At the largest jointly feasible point, 131,072 tokens/step:
| System | Throughput (tok/s) | Mean step time (ms) | Peak memory / GPU |
|---|---|---|---|
| RDEP Moonlight-16B | 24,871 | 5,274 | 58.92 GiB |
| Torchtitan TP + NCCL | 12,447 | 10,532 | 171.10 GiB |
At that shared point, RDEP is 2.00x faster and uses 2.90x less memory.
The feasibility frontier
The overlap-region result is only half the story. The bigger systems result is that RDEP admits workloads the collective baseline cannot.
| System | Largest stable point | First failing point | Observed boundary |
|---|---|---|---|
| RDEP Moonlight-16B | 262,144 tok/step | --- | 200-step run completed |
| Torchtitan TP + NCCL | 131,072 tok/step | 196,608 tok/step | OOM in routed MoE output buffer |
That feasibility gap is part of the result. RDEP is better at the shared point and also enlarges the feasible training region on the same hardware.
Target-hardware validation on GB300 fabric
RDEP was built for GB300NVL72-class fabrics rather than only one 8-GPU IPC node. The target-hardware story has two receipt lanes.
First, a matched 2-tray / 8-GPU fabric slice with the same routed shape (T=4096, H=2048, E_local=8, K=6):
| system | forward p50 / p99 | fwd+bwd p50 / p99 |
|---|---|---|
| B200 IPC | 1.22 / 1.53 ms | 3.99 / 4.24 ms |
| GB300 fabric | 1.26 / 1.43 ms | 2.38 / 2.41 ms |
The protocol survives the IPC-to-fabric transition without giving up forward latency, and the backward path improves materially on GB300 fabric with tighter tails.
Second, the full 18-tray / 72-GPU domain passes smoke, dispatch invariants, transport microbenchmarks, and routed-compute sanity at domain width.
Representative full-width receipts:
| Shape | forward p50 / p99 | fwd+bwd p50 / p99 |
|---|---|---|
T=1024 | 4.71 / 8.46 ms | 6.32 / 8.23 ms |
T=4096 | 6.01 / 34.27 ms | 8.48 / 51.12 ms |
At full width, the remaining pressure point is tail latency, not whether the routed design survives domain width at all.
Why the production use matters
This matters to me because RDEP was not only a benchmark trick. We used it in sustained production training on GB300NVL72-class hardware and did not see protocol-level failures.
That is a different confidence level than "the microbench looked good once."
Why I think this matters
RDEP is more than an MoE transport tweak.
Once it worked, the system got simpler exactly where it mattered. Tensor parallelism leaves the in-domain MoE hot path whenever dense replication fits. The NVLink domain becomes the object that pools sparse expert work. Expert compute runs materially hotter. The system beats the closest public overlap-region baseline where both stacks are runnable, admits workloads the collective baseline cannot run on the same hardware, and survives the target hardware from one 8-GPU slice through the full 72-GPU domain.
That is the core result.
What it does not say
Three boundaries matter. Routing quality remains separate from transport efficiency: RDEP creates the opportunity for hot grouped expert compute, but the router still has to realize it. Full-width NVL72 looks strong at the median while remaining tail-sensitive, with dispatch tail latency still the main fabric-width pressure point at the hotter shapes. And the segmented overlap argument remains analytical here; this post carries the transport and compute terms needed for the overlap inequalities, but it does not claim that one integrated training path has already been directly traced to realize the full steady-state overlap bound.
Receipts
The canonical paper is RDEP, and the linked receipt bundle keeps that paper attached to this post. The paper is the truth boundary here; this post is the guided tour. Controlled-routing ceilings, Zipf sensitivity, direct grouped-GEMM results, the overlap-region comparison, the feasibility frontier, and the GB300 fabric results all come from that paper-side surface without inventing a second, thinner story.
The longer paper carries the full analytical treatment and exact measurement recipes. The linked bundle keeps that paper attached to this post without inventing a second truth boundary.