Kubernetes Usage Guide

Introduction#

Our current guide assumes that the user has a Linux-based environment and has acquired the access token from the administrator.

During our trial period, please fill out the following Apex Trial Request Form to get access. Please also indicate that you need kubernetes access.

Accessing Kubernetes Cluster#

You can access the cluster by using the provided kubectl client on apex-login. Alternatively, you can also install kubectl on your local terminal and apply the following kubernetes config. Please refer to the following kubernetes guide for installing kubectl client.

Login#

Users should obtain the client access token provided by the administrator. Due to existing cluster limitation, we have to manually validate your credentials before provisioning your access token. Please contact apex@cmkl.ac.th to request your log-in token.

Note that we DO NOT keep your client access key. If you lose your access key, you will have to go thru the validation process to acquire the new token again.

You can use the following ~/.kube/config file for your kubeconfig template. Make sure to replace the comment with your assigned namespace, username, client key & certificate.

apiVersion: v1
clusters:
- cluster:
    # Apex cluster-generated CA
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1EUXdPVEE1TURRMU5Gb1hEVE14TURRd056QTVNRFExTkZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBS0F3CmJpRHRRUDk4NTZ5WDZvSEdTZVNXdjNDd3c3YkdDYS9uVGlJeXpNbVJoTUhrOHFKdjFXVFFLbDJRNDVRSm02Q0MKK1Iwa2w4bE5WeE1KeEZiVzU3NnhzV1FldXFza1ZTNG9JVS9HSTBnOCtnRFd4NllzaFptTUU1S0IvNzY5VGVrQgppUGY1ZEhYUGh1MVlQZ1BOL1NUcDRUMEFUbXpwbEdNT3kzVk9yMnFrL1hUOHI2c3JBRFQ1VWg4dHJqZlZIVytMCnZoQTNiZERuWll5eFcwaFFTRUFneHRubTNHdWN1WE1pSDNDWkw3QitaQXdRUDlHSFdNTy93Z01JUFhXYlExV3cKb3QvK3dBTERZNzBSWHR3ZXh5VTN1UkxwMG1ObW1oUU5Bd1JSS29BV1JMVHVKMDVYSXhsQUFMVzVBM2RkK09seApLLzM0U0dHNEozcVZ3ckNwalFFQ0F3RUFBYU1qTUNFd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFDZlg1aXZTZnRHWkJOQmtXdlpQa09zWVNOeTkKMUgwSkViVVB2RWswSHVBSDNSK3MyalZJRE1BVjlpYWN4S3RadDVFYnYzc3pDZkNSeXJPVlI4aFFzQ1RHZXBwdwpwc0NvZENPYWNKMUxWUVh6UTAvQkxleTZpYmVObnA4dHNZa1BGaTYwcGw2MTJvSS9ubmc2UERtTEFJaHVKMm9XCmxRZ0pwNzRYZGcxWDNSM24xZUJCL0c4VHlIQkk2U0Q3YjBJK1JRRG9Dd0FmZVFWRXNVUVZEOVRCUnA4QXREelQKcE95VkhSODk0dlV6bEYveEpIZzF0ZDZ4V0U4MHppKzZxbXdnNmRqNEFReVhNOUhHeURMRjIrVGxXNVpwSEp6dgp3UTZWOFVadGZoT0FFTUlYWERvMHBmVngzUEk1aTF3bGxIb1FyUm5oQjRZSHluTW96aWxUL1BvaWdZbz0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    server: https://127.0.0.1:6443
  name: apex
contexts:
- context:
    cluster: apex
    namespace: #Put your assigned namespace here
    user: #Put your username here
  name: apex-default
current-context: apex-default
kind: Config
preferences: {}
users:
- name: #Put your username here
  user:
    client-certificate-data: #Put your Base64 client certificate here
    client-key-data: #Put your Base64 client access key here

If you are not using kubectl from apex-login host, you have to forward kubectl access to your local machine. The cluster internal certificates can only be validated for API access from internal cluster and localhost (127.0.0.1).

ssh -L 6443:127.0.0.1:6443 username@apex-login.cmkl.ac.th

Validate that you can access the cluster.

kubectl cluster-info

You should get the result similar to the following output.

Kubernetes master is running at https://127.0.0.1:6443

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Simple Commands#

Get a list of the nodes in the cluster:

kubectl get nodes

Our cluster currently consists of 3 masters and 6 worker nodes. Only nodes marked as Ready and not SchedulingDisabled are available for kubernetes jobs.

NAME       STATUS                     ROLES    AGE   VERSION
archon-1   Ready                      master   54d   v1.18.9
archon-2   Ready                      master   54d   v1.18.9
archon-3   Ready                      master   54d   v1.18.9
prism-1    Ready,SchedulingDisabled   <none>   54d   v1.18.9
prism-2    Ready,SchedulingDisabled   <none>   54d   v1.18.9
prism-3    Ready,SchedulingDisabled   <none>   54d   v1.18.9
prism-4    Ready                      <none>   54d   v1.18.9
prism-5    Ready                      <none>   54d   v1.18.9
prism-6    Ready                      <none>   54d   v1.18.9

Get a list of running pods in your current namespace:

kubectl get pods

Running an Interactive Session#

For an interactive one-off execution, you can use kubectl run to execute a container.

kubectl run -i --rm --tty pytest --image=registry.apex.cmkl.ac.th/nvidia/pytorch:21.05-py3 --restart=Never -- bash

The above command will run a pod name pytest in your namespace using nvidia/pytorch:21.05-py3 from Apex local registry. The operation could take a while to complete. Upon success, you should get a bash prompt to your container. Note that the --rm option will automatically remove the pod after you exit.

If you don't see a command prompt, try pressing enter.
root@pytest:/workspace# nvidia-smi
Thu Jun  3 11:31:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   28C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   28C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   29C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   33C    P0    54W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   31C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   32C    P0    56W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0    59W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@pytest:/workspace# exit

Running a Job#

Running a Simple PyTorch Job#

Run the job.
A simple PyTorch Job can be run via kubectl by creating the following yaml configuration file:
pytorch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-job
spec:
backoffLimit: 5
template:
spec:
containers:
- name: pytorch-container
image: registry.apex.cmkl.ac.th/nvidia/pytorch:21.05-py3
command: ["/bin/sh"]
args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
You can then use the following command to run the job.
kubectl create -f pytorch-job.yaml
Note that Kubernetes rely on the configuration yaml which you can use to track & versioned control in order to provide consistent run parameters for your experiment.
The above job will execute a simple mnist dataset training using pytorch. Observe that
- We are pulling a pytorch container from the local Apex registry
- The container is limited to a single GPU resource
- The Kubernetes object we are creating is a job which spawns pod and runs this pod to completion a single time
Check on the job.
kubectl get jobs
Monitor the pod that's spawned from the job.
kubectl get pods
Follow the logs for the pod:
kubectl logs -f pytorch-job-<pod_id>
Delete the job (and the corresponding pod).
kubectl delete job pytorch-job

Using NGC Containers with Kubernetes and Launching Jobs#

NVIDIA GPU Cloud (NGC) manages a catalog of fully integrated and optimized DL framework containers that take full advantage of NVIDIA GPUs in both single and multi-GPU configurations. They include NVIDIA CUDA® Toolkit, DIGITS workflow, and the following DL frameworks: NVCaffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, PyTorch, TensorFlow, Theano, and Torch. These framework containers are delivered ready-to-run, including all necessary dependencies such as the CUDA runtime and NVIDIA libraries.

Note that we already have a local registry mirroring NGC on Apex available at https://registry.apex.cmkl.ac.th/. You may also use NGC if the required image is not available on the local registry.

To access the NGC container registry via Kubernetes, add a secret which will be employed when Kubernetes asks NGC to pull container images from it.

Generate an NGC API Key, which will be used for the Kubernetes secret.
- Login to the NGC Registry at https://ngc.nvidia.com/
- Go to https://ngc.nvidia.com/configuration/api-key
- Click on GENERATE API KEY
Using the NGC API Key, create a Kubernetes secret so that Kubernetes will be able to pull container images from the NGC registry. Create the secret by running the following command on the master (substitute the registered email account and secret in the appropriate locations).
kubectl create secret docker-registry nvcr.dgxkey --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-email=<email> --docker-password=<NGC API Key>
Check that the secret exists.
kubectl get secrets
You can now use the secret to pull custom NGC images by using the imagePullSecrets attribute. For example:
pytorch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-ngc-job
spec:
backoffLimit: 5
template:
spec:
imagePullSecrets:
- name: nvcr.dgxkey
containers:
- name: pytorch-container
image: nvcr.io/nvidia/pytorch:21.05-py3
command: ["/bin/sh"]
args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never

Acknowledgement: The content of this chapter has been adapted from the original DeepOps documentation