apex-documentation

Cluster Overview

DeepOps & Operating System

Apex cluster is an AI-focused accelerated cluster used to support AI training & inference workloads. The cluster is currently based on DeepOps 21.06. The cluster is a hybrid cluster with both Slurm and Kubernetes schedulers installed on the cluster. We are using Slurm for batch and user training jobs. Kubernetes is primarily used to support interactive / inference services. We are still working on distributed training to support users on Kubernetes and dynamic cluster management to support both schedulers.

Management Cluster

The management cluster (archon) includes 3 HPE ProLiant DL385 Gen10 systems, each with 2xAMD Epyc 7452 (32-core) / 512 GiB memory. They are labeled as control-plane or master in Kubernetes. The cluster is primarily used to serve control plane workloads for the cluster. Services include Slurm headnodes, kubernetes control plane (kube-scheduler, kube-controller-manager), Etcd cluster, and FreeIPA/Dex for identity management.

Compute Cluster

The compute cluster (prism) consists of 6 DGX A100 systems, each with 2x AMD Epyc 7742 (64-core) / 1 TiB memory / 8x NVIDIA A100-SXM4-40GB. These nodes are used as Slurm compute / K8s worker nodes for AI training / inference workloads.

Network

The systems are connected to two different networks. The primary & external network (odin) is directly attached to the collapsed 2x Mellanox SN2700 100 GbE core (spine/leaf/border) switches. Links from compute nodes are aggregated using Multi-Chassis Link Aggregation/LACP. The internal data network is connected to 2x QM8700 200 Gbps HDR InfiniBand.

Storage

The cluster storage is provided by 2x DDN A³I AI400X (4 controller units) running ExaScaler5 (LustreFS-based) parallel filesystem. The storage systems are divided into multiple classes/filesystems with usable storage capacity of 434 TiB NVMe and 1.9 PiB HDD and are accessible only through the cluster’s internal data network. There are currently 3 filesystems provisioned on the storage:

/lustre/scratch (ddn-lustre-scratch) NVMe-backed fast storage for AI batch training & io-intensive tasks
/lustre/ai (ddn-lustre-home) HDD-backed slow storage for data retention
/lustre/testfs HDD-backed strage for system experiments and validation (Admin-only)

Component Version

| Component | Version | Check command | Notes | | ————– | ——————— | —————– | —– | | DeepOps | 21.06 | - || | Slurm | 20.11.3 | sinfo -V || | Kubernetes | 1.21.1 | kubectl version | Upgraded from DeepOps v1.19.9 | | DGX OS | 5.0.5 | /etc/dgx-release|| | Ubuntu | 20.04.2 LTS | /etc/lsb-release|| | Linux kernel | 5.4.0-73-generic | uname -a | Upgraded from DGX OS 5.4.0-72-generic | | Mellanox OFED | 5.1-2.6.2.0 | ofed_info | Need upgrade to 5.3-1.0.5.0 for GPUDirect Storage | | Lustre client | 2.12.6-ddn3-1 | lfs --version || | NVIDIA driver | 470.42.01 | nvidia-smi | Upgraded from DGX OS v450.119.04 | | CUDA Toolkit | 11.4 | nvidia-smi | Upgraded from DGX OS v11.0|