Abstract
Parallel and distributed Deep Neural Network (DNN) training have become integral in data centers, significantly reducing DNN training time. The interconnection type among nodes and the chosen all-reduce algorithm critically impact this speed-up. This paper examines the efficiency differences in distributed DNN training across optical and electrical interconnect systems using various all-reduce algorithms. We first explore the Ring and Recursive Doubling (RD) all-reduce algorithms in both systems, followed by formulating a communication cost model for these algorithms. Performance comparison is then carried out via extensive experiments. Our results reveal that, in 1024-node systems, the Ring algorithm outperforms the RD algorithm in optical and electrical interconnects when data transfer exceeds 64MB and 1024 MB, respectively. We also find that both Ring and RD algorithms in optical interconnect systems reduce average communication time by around 75% compared to electrical interconnect systems across four different DNNs. Interestingly, the communication time of the RD algorithm, but not the Ring algorithm, reduces as the number of wavelengths increase in optical interconnects. These findings provide valuable insights into DNN training optimization across various interconnect systems and lay the groundwork for future related research.