Peta-scale embedded photonics architecture for distributed deep learning applications

Wu, Zhenguo; Dai, Liang Yuan; Novick, Asher; Glick, Madeleine; Zhu, Ziyi; Rumley, Sébastien; Michelogiannakis, George; Shalf, John; Bergman, Keren

doi:10.1109/JLT.2023.3276588

Wu, Zhenguo; Dai, Liang Yuan; Novick, Asher; Glick, Madeleine; Zhu, Ziyi; Rumley, Sébastien; Michelogiannakis, George; Shalf, John; Bergman, Keren

2023

Télécharger

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Résumé

As Deep Learning (DL) models grow larger and more complex, training jobs are increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs. Each CU processes a sub-part of the model and synchronizes results with others. Communication among these CUs has emerged as a key bottleneck in the training process. In this work, we present SiPAC, a Silicon Photonic Accelerated Compute cluster. SiPAC accelerates distributed DL training by means of two co-designed components: a photonic physical layer and a novel collective algorithm. The physical layer exploits embedded photonics to bring peta-scale I/O directly to the CUs of a DL optimized cluster and uses resonator-based optical wavelength selectivity to realize hardware multi-casting. The collective algorithm builds on the hardware multi-casting primitive. This combination expedites a variety of collective communications commonly employed in DL training and has the potential to drastically ease the communication bottlenecks. We demonstrate the feasibility of realizing the SiPAC architecture through 1) an optical testbed experiment where an array of comb laser wavelengths are shuffled by a cascaded ring switch, with each ring selecting and forwarding multiple wavelengths to increase the effective communication bandwidth and hence demonstrating the hardware multicasting primitive, and 2) a four-GPU testbed running a realistic DL workload that achieves 22% system-level performance improvement relative to a similarly-sized leaf-spine topology. Large scale simulations show that SiPAC achieves a 1.4× to 5.9× communication time reduction compared to state-of-the-art compute clusters for representative collective communications.

Détails

Titre Peta-scale embedded photonics architecture for distributed deep learning applications

Auteur(s)/ trice(s) Wu, Zhenguo (Columbia University, New York, NY, USA)
Dai, Liang Yuan (Columbia University, New York, NY, USA)
Novick, Asher (Columbia University, New York, NY, USA)
Glick, Madeleine (Columbia University, New York, NY, USA)
Zhu, Ziyi (Columbia University, New York, NY, USA)
Rumley, Sébastien (School of Engineering and Architecture (HEIA-FR), HES-SO University of Applied Sciences and Arts Western Switzerland)
Michelogiannakis, George (Lawrence Berkeley National Laboratory, Berkeley, CA, USA)
Shalf, John (Lawrence Berkeley National Laboratory, Berkeley, CA, USA)
Bergman, Keren (Columbia University, New York, NY, USA)

Date 2023-06

Publié dans Journal of Lightwave Technology

Volume 2023, vol. 41

Numéro 12

Pages / Numéro d'article 3737-3749

Pagination 13 p.

DOI https://doi.org/10.1109/JLT.2023.3276588

ISSN 0733-8724

Mots-clés (libres) optical switches ; bandwidth ; training ; topology ; copper ; silicon photonics ; computational modeling

Type d'article scientifique

Domaine Ingénierie et Architecture

Ecole HEIA-FR

Institut iCoSys - Institut des systèmes complexes

Le document apparaît dans Articles scientifiques
Global

Résumé

Détails

Actions

PDF