Skip to main content

Cluster Overview

DeepOps & Operating System#

Apex cluster is an AI-focused accelerated cluster used to support AI training & inference workloads. The cluster is currently based on DeepOps 21.06. The cluster is a hybrid cluster with both Slurm and Kubernetes schedulers installed on the cluster. We are using Slurm for batch and user training jobs. Kubernetes is primarily used to support interactive / inference services. We are still working on distributed training to support users on Kubernetes and dynamic cluster management to support both schedulers.

Management Cluster#

The management cluster (archon) includes 3 HPE ProLiant DL385 Gen10 systems, each with 2xAMD Epyc 7452 (32-core) / 512 GiB memory. They are labeled as control-plane or master in Kubernetes. The cluster is primarily used to serve control plane workloads for the cluster. Services include Slurm headnodes, kubernetes control plane (kube-scheduler, kube-controller-manager), Etcd cluster, and FreeIPA/Dex for identity management.

Compute Cluster#

The compute cluster (prism) consists of 6 DGX A100 systems, each with 2x AMD Epyc 7742 (64-core) / 1 TiB memory / 8x NVIDIA A100-SXM4-40GB. These nodes are used as Slurm compute / K8s worker nodes for AI training / inference workloads.

Network#

The systems are connected to two different networks. The primary & external network (odin) is directly attached to the collapsed 2x Mellanox SN2700 100 GbE core (spine/leaf/border) switches. Links from compute nodes are aggregated using Multi-Chassis Link Aggregation/LACP. The internal data network is connected to 2x QM8700 200 Gbps HDR InfiniBand.

Storage#

The cluster storage is provided by 2x DDN A3I AI400X (4 controller units) running ExaScaler5 (LustreFS-based) parallel filesystem. The storage systems are divided into multiple classes/filesystems with usable storage capacity of 434 TiB NVMe and 1.9 PiB HDD and are accessible only through the cluster's internal data network. There are currently 3 filesystems provisioned on the storage:

  • /lustre/scratch (ddn-lustre-scratch) NVMe-backed fast storage for AI batch training & io-intensive tasks
  • /lustre/ai (ddn-lustre-home) HDD-backed slow storage for data retention
  • /lustre/testfs HDD-backed strage for system experiments and validation (Admin-only)

Component Version#

ComponentVersionCheck commandNotes
DeepOps21.06-
Slurm20.11.3sinfo -V
Kubernetes1.21.1kubectl versionUpgraded from DeepOps v1.19.9
DGX OS5.0.5/etc/dgx-release
Ubuntu20.04.2 LTS/etc/lsb-release
Linux kernel5.4.0-73-genericuname -aUpgraded from DGX OS 5.4.0-72-generic
Mellanox OFED5.1-2.6.2.0ofed_infoNeed upgrade to 5.3-1.0.5.0 for GPUDirect Storage
Lustre client2.12.6-ddn3-1lfs --version
NVIDIA driver470.42.01nvidia-smiUpgraded from DGX OS v450.119.04
CUDA Toolkit11.4nvidia-smiUpgraded from DGX OS v11.0