Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Fei Dai; Yawen Chen; Zhiyi Huang; Haibo Zhang; Hui Tian

doi:10.1007/978-981-97-0834-5_23

Back

Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Conference proceeding

Peer reviewed

Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Fei Dai, Yawen Chen, Zhiyi Huang, Haibo Zhang and Hui Tian

ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT I, Vol.14487, pp.401-418

Lecture Notes in Computer Science

01/01/2024

DOI: https://doi.org/10.1007/978-981-97-0834-5_23

Abstract

Computer Science

Computer Science, Artificial Intelligence

Computer Science, Software Engineering

Computer Science, Theory & Methods

Science & Technology

Technology

Parallel and distributed Deep Neural Network (DNN) training have become integral in data centers, significantly reducing DNN training time. The interconnection type among nodes and the chosen all-reduce algorithm critically impact this speed-up. This paper examines the efficiency differences in distributed DNN training across optical and electrical interconnect systems using various all-reduce algorithms. We first explore the Ring and Recursive Doubling (RD) all-reduce algorithms in both systems, followed by formulating a communication cost model for these algorithms. Performance comparison is then carried out via extensive experiments. Our results reveal that, in 1024-node systems, the Ring algorithm outperforms the RD algorithm in optical and electrical interconnects when data transfer exceeds 64MB and 1024 MB, respectively. We also find that both Ring and RD algorithms in optical interconnect systems reduce average communication time by around 75% compared to electrical interconnect systems across four different DNNs. Interestingly, the communication time of the RD algorithm, but not the Ring algorithm, reduces as the number of wavelengths increase in optical interconnects. These findings provide valuable insights into DNN training optimization across various interconnect systems and lay the groundwork for future related research.

Metrics

1 Record Views

Details

Record Identifier: 9926549494001891
Title: Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems
Creators: Fei Dai
Yawen Chen
Zhiyi Huang
Haibo Zhang
Hui Tian
Contributors: Z Tari (Editor)
K Li (Editor)
H Wu (Editor)
Publication Details: ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT I, Vol.14487, pp.401-418
Academic Unit: Computer Science
Publisher: Springer Nature
Date published ; e-published: 01/01/2024
Language: English
Resource Type; Subtype: Conference proceeding

Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Abstract

Related links

Metrics

Details

Usage Policy