A training job is only as fast as its slowest GPU finishing a collective. That single sentence explains most of what makes AI networking different from everything that came before it.
Collectives turn the network into a barrier
When 256 GPUs run an all-reduce, every rank waits for the last byte before the next step begins. There is no “mostly done.” A microburst that delays one flow by two milliseconds doesn’t slow that flow — it stalls the entire job for two milliseconds, on every device.
The uncomfortable truth: tail latency is the only latency that matters on a GPU fabric. Your p50 is a vanity metric.
Where the buffers go
The math is unforgiving. To absorb a burst without dropping, a switch needs buffer proportional to bandwidth × round-trip time — and at 400G that number gets large fast:
- Shallow-buffer switches drop, and a drop on a lossless fabric triggers pause, which spreads.
- Deep-buffer switches add latency, which is the thing you were trying to avoid.
- The answer is almost never “more buffer.” It’s PFC, ECN, and a congestion-control loop tuned for the fabric.
Get the buffer and back-pressure story right and the fabric disappears into the background — which is exactly where it belongs.