Scalable Multicast in AI Datacenters

Last updated: 11/4/2025

Background

Nowadays, ML in a datacenter heavily relies on collective communications, while most of frameworks still use unicast-based multicast to build algorithms like BROADCAST, ALL-GATHER, ALL-REDUCE. Also, ML traffic is bursty, intensive and repetitive. Can we utilize switches' Packet Replication Engine (PRE) to perform in-network replication to minimize both latency and bandwidth consumption?

To build a reliable and scalable multicast framework for AI training, we need to provide solutions to:
• Routing
• Host behavior

Routing

Routing is the coorperation between the sender and switches in order to precisely deliver all packets to the receivers. Basically the sender needs to encode multicast information, and the switches parse it to know which ports get invovled in PRE. The sender is able to either utilize pre-install rules, or construct and delete rules into the programmable switches on the fly.

We introduce PEEL, a power-of-two prefix aggregation scheme that compresses per-switch state from exponential to linear. Please check our paper on HotNets'25: One to Many: Closing the Bandwidth Gap in AI Datacenters with Scalable Multicast

PEEL

Host behavior

On the host side, we have to take congestion control, recovery into consideration while building the framework. Also, your network does not stop at the NIC. How fast can your application actually hold the data? How to support RC semantic if running RDMA? How to achieve out-of-order delivery, in-order completion? Should retransmission be implemented in network layer or application layer? What is your recovery strategy?

Go back to our scenario, AL training. What issues we are encountering on earth? Bandwidth consumption, incast and scalable QP numbers. We do not need ultra-low latency but do want to alleviate long-tail latency problem. What would be the solution? Direct Data Placement (DDP) and dynamic load balancing.

Here to remind [RFC1925] The Twelve Networking Truths.

(1) It Has To Work.

(2) No matter how hard you push and no matter what the priority, you can't increase the speed of light.

(2a) (corollary). No matter how hard you try, you can't make a baby in much less than 9 months. Trying to speed this up *might* make it slower, but it won't make it happen any quicker.

(3) With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea. It is hard to be sure where they are going to land, and it could be dangerous sitting under them as they fly overhead.

(4) Some things in life can never be fully appreciated nor understood unless experienced firsthand. Some things in networking can never be fully understood by someone who neither builds commercial networking equipment nor runs an operational network.

(5) It is always possible to aglutenate multiple separate problems into a single complex interdependent solution. In most cases this is a bad idea.

(6) It is easier to move a problem around (for example, by moving the problem to a different part of the overall network architecture) than it is to solve it.

(6a) (corollary). It is always possible to add another level of indirection.

(7) It is always something

(7a) (corollary). Good, Fast, Cheap: Pick any two (you can't have all three).

(8) It is more complicated than you think.

(9) For all resources, whatever it is, you need more.

(9a) (corollary) Every networking problem always takes longer to solve than it seems like it should.

(10) One size never fits all.

(11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.

(11a) (corollary). See rule 6a.

(12) In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.