Nvidia assembled the world’s 7th fastest supercomputer in one month
In a nutshell: Nvidia has detailed the assembly process of the Selene supercomputer, which became the world’s seventh fastest supercomputer in June. The entire thing was assembled amid the pandemic in just three and a half weeks with a socially-distanced team of six, plus a handy robot named Trip.
Selene is a rather unique supercomputer. It uses Nvidia’s commercially-available GPU-accelerated DGX SuperPOD architecture, instead of the custom CPU-heavy designs that dominate most of the Top500 list. It ranks second on the Green500 most power-efficient supercomputer list.
In numbers, Selene uses 560 AMD Epyc 7742 CPUs (64 cores each) and 2240 Nvidia A100 GPUs. Its peak theoretical performance is just under 35 thousand teraflops.
Nvidia’s previous supercomputers took months to construct and were extremely difficult to maintain and upgrade. When it came to designing Selene they tried to make it as simple and modular as possible. Each of Selene’s 280 nodes is a standardized DGX pod containing eight Nvidia A100 GPUs and two AMD Epyc CPUs. A handful of pods are stacked in a glorified filing cabinet (just being honest) which are strung together in groups of sixteen to form a SuperPOD.
Selene’s homogeneity is what enabled it to be assembled so quickly. It was mostly a matter of moving each DGX pod into the right spot and wiring it up and checking that it worked. Wiring a supercomputer is always a tricky job (particularly six feet apart) but Nvidia used Mellanox’s InfiniBand switches to reduce the number of cables required while simultaneously increasing bandwidth.
Selene is cooled on a per-SuperPOD basis. All of the SuperPODs reside in one giant air-conditioned warehouse. They’re raised off the ground with fans underneath to push the cool air up into the DGX pods. Nvidia’s tiny assembly team only needed to install the flooring and seal up the SuperPODs to control the flow of air.
Nvidia got creative with the monitoring equipment for Selene. They purchased a little robot called Trip, who can be controlled remotely and wheeled around to observe the goings-on inside Selene. They also built a bot for Slack that sends them notifications when the hardware is misbehaving or when a cable has come loose.
Selene is currently working on about a thousand tasks mostly oriented around AI development and neural network training. Its spare cycles are dedicated to coronavirus research.