Skip to main content

Slurm Usage Guide

Introduction#

Slurm is an open-source job scheduling system for Linux clusters, most frequently used for high-performance computing (HPC) applications. This guide will cover some of the basics to get started using slurm as a user. For more information, the Slurm Docs are a good place to start.

After slurm is deployed on a cluster, a slurmd daemon should be running on each compute system. Users do not log directly into each compute system to do their work. Instead, they execute slurm commands (ex: srun, sinfo, scancel, scontrol, etc) from the login node. These commands communicate with the slurmd daemons on each host to perform work.

Simple Commands#

Cluster state with sinfo#

To "see" the cluster, ssh to apex-login.cmkl.ac.th and run the sinfo command:

$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 7-00:00:00 4 idle prism-[1-4]
batch* up 7-00:00:00 2 drain prism-[5-6]

Slurm nodes are grouped into partitions. In this case, the cluster has a single partition named batch which is also the default partition, indicated by the * symbol. There are 6 nodes available on this system, all in an idle state. The timelimit 7-00:00:00 indicates that the maximum execution time limit for a job is 7 days. If a node is busy, its state will change from idle to alloc:

$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 7-00:00:00 1 alloc prism-1
batch* up 7-00:00:00 1 mixed prism-2
batch* up 7-00:00:00 2 idle prism-[3-4]
batch* up 7-00:00:00 1 drain* prism-5
batch* up 7-00:00:00 1 drain prism-6

Nodes in mixed state indicate that some of the node's CPUs are allocated, while others are available. Nodes in drain state are not available for job scheduling, usually because the nodes are in maintenance or are allocated for kubernetes scheduler in a hybrid-cluster environment.

The sinfo command can be used to output a lot more information about the cluster. Check out the sinfo doc for more.

Running a job with srun#

To run a job, use the srun command:

$ srun hostname
prism-2

What happened here? With the srun command we instructed slurm to find the first available node and run hostname on it. It returned the result in our command prompt. It's just as easy to run a different command that runs a python script or a container using srun.

Most of the time, scheduling on a full system is not necessary and it's better to request only a certain portion of the GPUs:

$ srun --gres=gpu:2 env | grep CUDA
CUDA_VISIBLE_DEVICES=0,1

Or, conversely, sometimes it's necessary to run on multiple CPUs:

$ srun --ntasks 2 -l hostname
0: prism-2
1: prism-3

Running an interactive job#

Especially when developing and experimenting, it's helpful to run an interactive job, which requests a resource and provides a command prompt as an interface to it:

archon-2:~$ srun --pty /bin/bash
prism-1:~$ hostname
prism-1
prism-1:~$ exit

During interactive mode, the resource is being reserved for use until the prompt is exited (as shown above). Commands can be run in succession.

Note: before starting an interactive session with srun it may be helpful to create a session on the login node with a tool like tmux or screen. This will prevent a user from losing interactive jobs if there is a network outage or the terminal is closed.

More Advanced Use#

Run a batch job#

While the srun command blocks any other execution in the terminal, sbatch can be run to queue a job for execution once resources are available in the cluster. Also, a batch job will let you queue up several jobs that run as nodes become available. It's therefore good practice to encapsulate everything that needs to be run into a script and then execute with sbatch vs with srun:

$ cat script.sh
#!/bin/bash
/bin/hostname
sleep 30
$ sbatch script.sh

You can observer your output in slurm-xxxxx.out, replacing xxxxx with your job id.

Observing running jobs with squeue#

To see which jobs are running in the cluster, use the squeue command:

$ squeue -a -l
Wed Jul 14 09:08:18 2021
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
9 batch bash user01 RUNNING 5:43 7-00:00:00 1 prism-1

To see just the running jobs for a particular user USERNAME:

$ squeue -l -u USERNAME

Cancel a job with scancel#

To cancel a job, use the squeue command to look up the JOBID and the scancel command to cancel it:

$ squeue
$ scancel JOBID

Resource & Time Limits#

The following command launches an interactive task (bash) for the job named gputest which requires 2 GPUs, 20 CPUs, 20GB of memory. It also extends the job time limit (--time) to 1 day from the 1 hour-default. The command uses the container image from local registry (note the # used to separate the container registry server and path). It also mount your home directory to /userspace instead of the default /root directory.

srun --gres=gpu:2 -c 20 --mem 20G --job-name gputest --time 1-0 --container-image=registry.apex.cmkl.ac.th#nvidia/pytorch:21.05-py3 --no-container-mount-home --container-mounts=/home/yourusername:/userspace --pty bash

Pyxis Container#

Slurm task can be warpped by container technology using enroot, singularity, and Pyxis. The following commands demonstrate the usages of Pyxis, which natively integrates to slurm.

The regular srun command will run the given command, grep, on a bare metal compute node. You can easily add the flag, container-image, to run the given command, grep, in the container on the compute node instead.

$ srun grep PRETTY /etc/os-release
PRETTY_NAME="Ubuntu 20.04.2 LTS"
$ # run the same command, but now inside of a container
$ srun --container-image=centos grep PRETTY /etc/os-release
PRETTY_NAME="CentOS Linux 7 (Core)"

You can use pre-built images from Nvidia NGC, CMKL's local regisry, or your own docker images from Docker Hub.

# Nvidia NGC (nvcr.io)
$ srun --mem 60G --container-image=nvcr.io/nvidia/pytorch:21.05-py3 python -c "import torch; print(torch.__version__)"
1.7.1
# CMKL's local registry (registry.apex.cmkl.ac.th)
$ srun --mem 60G --container-image=registry.apex.cmkl.ac.th#nvidia/pytorch:21.05-py3 python -c "import torch; print(torch.__version__)"
1.7.1
# Docker hub (no prefix)
$ srun --mem 60G --container-image=pytorch/pytorch python -c "import torch; print(torch.__version__)"
1.7.1

In general, Pyxis will mount your home directory to /root inside the container. However, you can avoid that behavior by using --no-container-mount-home and then selectively mount some specific directories using --container-mounts. Additionally, the command container-workdir will set the container's entry directroy as /work1.

# list directory /work1 inside the container
srun --container-image=nvcr.io/nvidia/pytorch:21.05-py3 \
--no-container-mount-home \
--mem 60G \
--container-mounts=/home/yourusername/dir1:/work1 \
--container-workdir=/work1 ls /work1

You can also mount more than one directory using comma , as a separator.

# list direcrectory of /work2 inside the container
srun --container-image=nvcr.io/nvidia/pytorch:21.05-py3 \
--no-container-mount-home \
--mem 60G \
--container-mounts=/home/ekapolc/work1:/work1,/home/ekapolc/work2:/work2 \
--container-workdir=/work1 ls /work2

More detail can be found in https://github.com/NVIDIA/pyxis.

Multi-GPUs training#

In the following command, Slurm will request <NGPU> for each of the <NNODE> nodes. Total GPU requrested = <NGPU>*<NNODE>. In each node, the process, train.py, will be initiated <NTASKS> times. This <NTASK> should be equal to <NGPU>.

srun --gres=gpu:<NGPU> --ntasks-per-node=<NTASKS> --cpus-per-task=<CPU cores per GPU> --nodes=<NNODE> \
--container-image=nvcr.io/nvidia/pytorch:21.05-py3 \
--no-container-mount-home \
--mem 60G \
--container-mounts=/home/ekapolc/work1:/work1,/home/ekapolc/work2:/work2 \
--container-workdir=/work1 python train.py

Additional Resources#


Acknowledgement: The content of this chapter has been adapted from the original DeepOps documentation