Abstract
As deep learning (DL) algorithms evolve and data volumes expand, training deep neural networks (DNNs) has become essential across various domains, delivering unprecedented task accuracy. However, with the surge in dataset size and advancements in DNN models, the training process has become increasingly time-consuming. Traditionally, the acceleration of DNN training is tackled by adding more cores or nodes for parallel training in a chip or distributed system. This approach, however, encounters communication bottlenecks as it scales using electrical interconnect systems. A promising alternative is optical interconnection technology, which provides high bandwidth and parallel communication through wavelength-division multiplexing (WDM) at various integration levels. Yet, the fundamental difference between optical and electrical interconnects imposes challenges in directly applying existing parallel DNN training methods, necessitating the development of specific acceleration schemes for DNN training in optical interconnect systems. This thesis delves into such performance optimisation in optical network-on-chip (ONoC) and optical interconnect systems. It explores methods to harness the potential of optical communication for both accelerating DNN training on ONoC and optical interconnect systems, and optimising communication in distributed DNN training.
Fully connected neural networks (FCNNs) are pivotal in DL, and the fully connected layer is a critical component in both convolutional neural networks and Transformers. With this in mind, we first propose a fine-grained parallel computing model for accelerating FCNN training on ONoC. This model determines the optimal number of cores for each execution stage, thus minimising the time for one FCNN training epoch. We propose three mapping strategies for core allocation and compare their merits and drawbacks regarding hotspot level, memory requirement, and state transitions. By balancing computation and communication within the ONoC context, our scheme bridges the gap in optimising parallel FCNN training, providing a powerful tool for efficient FCNN training on ONoC.
As the size of datasets and complexity of DNN models continue to increase, it becomes increasingly necessary to use distributed DNN training instead of relying on a single machine. Given the frequent use of collective communication algorithms (All-reduce and All-gather) in distributed DNN training, we propose two efficient algorithms to minimise communication time in optical interconnect systems. First, we introduce WRHT, an All-reduce algorithm for distributed data-parallel DNN training that groups nodes hierarchically and reuses wavelengths to reduce communication steps and time. Second, we present OpTree, an effective All-gather algorithm for optical interconnect systems that optimise communication time by calculating the ideal m-ary tree for optical routing. With WRHT and OpTree, the communication time of distributed DNN training in optical interconnect systems can be significantly reduced, enhancing the overall efficiency of the training process.
Finally, to tackle the challenges of high memory requirements and substantial communication overhead in distributed data-parallel DNN training, we present a layer-wise hybrid-parallel acceleration scheme (LHAS) to expedite distributed DNN training on optical interconnect systems. LHAS includes the analysis of intra-layer and inter-layer communication, a cost model for communication and computation, and solutions for group communication and handling DNNs with multiple branches. By determining the optimal configuration (including the parallel method and the optimal number of nodes for each layer), LHAS can minimise the total DNN training time. LHAS, a notable advancement in distributed DNN training on optical interconnect systems, proposes an innovative and efficient approach for DNN model training, potentially transforming the field and inspiring future research.