sinfo
To “see” the cluster, ssh to apex-login.cmkl.ac.th
and run the sinfo
command:
cmkladmin@archon-2:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 7-00:00:00 1 drain prism-2
batch* up 7-00:00:00 4 mix prism-[1,3-5]
batch* up 7-00:00:00 1 idle prism-6
There are 6 nodes available on this system, all in an idle
state. The timelimit 7-00:00:00 indicates that the maximum execution time limit for a job is 7 days. If a node is busy, its state will change from idle
to mix
or alloc
:
cmkladmin@archon-2:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 7-00:00:00 1 alloc prism-1
batch* up 7-00:00:00 1 mix prism-2
batch* up 7-00:00:00 2 idle prism-[3-4]
batch* up 7-00:00:00 1 drain* prism-5
batch* up 7-00:00:00 1 drain prism-6
Nodes in mix
state indicate that some of the node’s CPUs are allocated, while others are available. Nodes in drain
state are not available for job scheduling, usually because the nodes are in maintenance or are allocated for kubernetes scheduler in a hybrid-cluster environment. To check for the reason, use sinfo -R
.
Check out the sinfo doc for more.
srun
cmkladmin@archon-2:~$ srun hostname
prism-3
We can assign the A100 GPU to our job by include the gpu option. In the below example, we request 1 GPU for our interactive job to run the nvidia-smi
command.
cmkladmin@archon-2:~$ srun --gres=gpu:1 --pty nvidia-smi
Mon May 29 15:27:19 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:47:00.0 Off | 0 |
| N/A 24C P0 55W / 400W| 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
To run a bash interactive job on a specific node, just add these options.
cmkladmin@archon-2:~$ srun -w prism-2 --gres=gpu:2 --mem=16G --pty bash
cmkladmin@prism-2:~$ hostname
prism-2
cmkladmin@prism-2:~$ nvidia-smi
Mon May 29 16:06:52 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 21C P0 50W / 400W| 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 19C P0 51W / 400W| 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
cmkladmin@prism-2:~$ exit
exit
cmkladmin@archon-2:~$
After finish using your interactive job, use exit or ctrl+D keys combination to end the session.
squeue
cmkladmin@archon-2:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
53091 batch training cmkladmin R 5:43 1 prism-2
sbatch
Here is an example script for running a batch job on Apex.
test-sbatch.sh
#!/bin/bash
#SBATCH -A <YOUR_USERNAME> # in this case cmkladmin
#SBATCH --job-name=test-sbatch
hostname
sleep 30
To run this script, use the sbatch command.
# submit job
cmkladmin@archon-2:~$ sbatch test-sbatch.sh
Submitted batch job 53093
# check our job status
cmkladmin@archon-2:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
53093 batch test-sba cmkladmi R 0:02 1 prism-3
When using our system, there may be more than one job or user running on the cluster. squeue
command allow us to add the option to display only your running job.
cmkladmin@archon-2:~$ squeue -l -u $USER
You can use the scancel
command to cancel your job.
# We need your job id (JOBID) to indicate the job you want to cancel
cmkladmin@archon-2:~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
53093 batch test-sca cmkladmi R 0:25 1 prism-3
# ^
# |
# The job id is 53093, to cancel it use this command
cmkladmin@archon-2:~$ scancel 53093
# check the result
cmkladmin@archon-2:~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Run an interactive job (bash) with 1 A100 GPU for 10 minutes (Max)
srun --gres=gpu:1 --time 00:10:00 --job-name <job-name> --pty bash
For more options please see the srun doc.
Run a batch job with 1 A100 GPU for 10 minutes (Max)
#!/bin/bash
#SBATCH -A <username>
#SBATCH --job-name=<job-name>
#SBATCH -N 1
#SBATCH --gres=gpu:1
#SBATCH -t 00:10:00
#SBATCH -o out_%j.txt
#SBATCH -e err_%j.txt
source venv/bin/activate
python3 ./main.py
For more options please see the sbatch doc.