Abstract

Current optimizers for collective communication in distributed machine training fail to scale to the level needed by today’s cloud operators and/or sacrifice solution quality significantly in pursuit of that scalability. Without such optimizers, GPU clusters spend significant time with idle GPUs and wasted resources.

TE-CCL takes a traffic engineering-based approach to collective communication. Compared to a state-of-the-art optimizer, TACCL, TE-CCL produced schedules with $2×$ better performance on topologies TACCL supports (and took the same amount of time to do so). TE-CCL also scales to larger topologies compared to TACCL. On our GPU testbed, TE-CCL outperformed TACCL by $2.14×$ and RCCL by $3.18×$.