Cluster Overview
#
DeepOps & Operating SystemApex cluster is an AI-focused accelerated cluster used to support AI training & inference workloads. The cluster is currently based on DeepOps 21.06. The cluster is a hybrid cluster with both Slurm and Kubernetes schedulers installed on the cluster. We are using Slurm for batch and user training jobs. Kubernetes is primarily used to support interactive / inference services. We are still working on distributed training to support users on Kubernetes and dynamic cluster management to support both schedulers.
#
Management ClusterThe management cluster (archon
) includes 3 HPE ProLiant DL385 Gen10 systems, each with 2xAMD Epyc 7452 (32-core) / 512 GiB memory. They are labeled as control-plane
or master
in Kubernetes. The cluster is primarily used to serve control plane workloads for the cluster. Services include Slurm headnodes, kubernetes control plane (kube-scheduler, kube-controller-manager), Etcd cluster, and FreeIPA/Dex for identity management.
#
Compute ClusterThe compute cluster (prism
) consists of 6 DGX A100 systems, each with 2x AMD Epyc 7742 (64-core) / 1 TiB memory / 8x NVIDIA A100-SXM4-40GB. These nodes are used as Slurm compute / K8s worker nodes for AI training / inference workloads.
#
NetworkThe systems are connected to two different networks. The primary & external network (odin
) is directly attached to the collapsed 2x Mellanox SN2700 100 GbE core (spine/leaf/border) switches. Links from compute nodes are aggregated using Multi-Chassis Link Aggregation/LACP. The internal data network is connected to 2x QM8700 200 Gbps HDR InfiniBand.
#
StorageThe cluster storage is provided by 2x DDN A3I AI400X (4 controller units) running ExaScaler5 (LustreFS-based) parallel filesystem. The storage systems are divided into multiple classes/filesystems with usable storage capacity of 434 TiB NVMe and 1.9 PiB HDD and are accessible only through the cluster's internal data network. There are currently 3 filesystems provisioned on the storage:
/lustre/scratch
(ddn-lustre-scratch
) NVMe-backed fast storage for AI batch training & io-intensive tasks/lustre/ai
(ddn-lustre-home
) HDD-backed slow storage for data retention/lustre/testfs
HDD-backed strage for system experiments and validation (Admin-only)
#
Component VersionComponent | Version | Check command | Notes |
---|---|---|---|
DeepOps | 21.06 | - | |
Slurm | 20.11.3 | sinfo -V | |
Kubernetes | 1.21.1 | kubectl version | Upgraded from DeepOps v1.19.9 |
DGX OS | 5.0.5 | /etc/dgx-release | |
Ubuntu | 20.04.2 LTS | /etc/lsb-release | |
Linux kernel | 5.4.0-73-generic | uname -a | Upgraded from DGX OS 5.4.0-72-generic |
Mellanox OFED | 5.1-2.6.2.0 | ofed_info | Need upgrade to 5.3-1.0.5.0 for GPUDirect Storage |
Lustre client | 2.12.6-ddn3-1 | lfs --version | |
NVIDIA driver | 470.42.01 | nvidia-smi | Upgraded from DGX OS v450.119.04 |
CUDA Toolkit | 11.4 | nvidia-smi | Upgraded from DGX OS v11.0 |