Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems

Fei Dai; Yawen Chen; Zhiyi Huang; Haibo Zhang; Fangfang Zhang

doi:10.1145/3572848.3577391

Back

Conference proceeding

Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems

Fei Dai, Yawen Chen, Zhiyi Huang, Haibo Zhang and Fangfang Zhang

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp.422-424

PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

25/02/2023

DOI: https://doi.org/10.1145/3572848.3577391

Abstract

Computing methodologies -- Artificial intelligence -- Distributed artificial intelligence

Computing methodologies -- Parallel computing methodologies -- Parallel algorithms

All-reduce is the crucial communication primitive to reduce model parameters in distributed Deep Neural Networks (DNN) training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs due to the low data bandwidth of the electrical interconnect systems. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect systems. WRHT can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. Simulations using real DNN models show that, compared to all-reduce algorithms in the electrical and optical network systems, our approach reduces communication time by 75.76% and 91.86%, respectively.

Metrics

1 Record Views

Details

Record Identifier: 9926511482101891
Title: Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
Creators: Fei Dai
Yawen Chen
Zhiyi Huang
Haibo Zhang
Fangfang Zhang
Publication Details: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp.422-424
Conference: PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
Academic Unit: Computer Science
Publisher: ACM
Date published ; e-published: 25/02/2023
Language: English
Resource Type: Conference proceeding

Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems

Abstract

Related links

Metrics

Details

Usage Policy