Skip to main content

Kubernetes Usage Guide

Introduction#

Our current guide assumes that the user has a Linux-based environment and has acquired the access token from the administrator.

During our trial period, please fill out the following Apex Trial Request Form to get access. Please also indicate that you need kubernetes access.

Accessing Kubernetes Cluster#

You can access the cluster by using the provided kubectl client on apex-login. Alternatively, you can also install kubectl on your local terminal and apply the following kubernetes config. Please refer to the following kubernetes guide for installing kubectl client.

Login#

Users should obtain the client access token provided by the administrator. Due to existing cluster limitation, we have to manually validate your credentials before provisioning your access token. Please contact apex@cmkl.ac.th to request your log-in token.

Note that we DO NOT keep your client access key. If you lose your access key, you will have to go thru the validation process to acquire the new token again.

You can use the following ~/.kube/config file for your kubeconfig template. Make sure to replace the comment with your assigned namespace, username, client key & certificate.

apiVersion: v1
clusters:
- cluster:
# Apex cluster-generated CA
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1EUXdPVEE1TURRMU5Gb1hEVE14TURRd056QTVNRFExTkZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBS0F3CmJpRHRRUDk4NTZ5WDZvSEdTZVNXdjNDd3c3YkdDYS9uVGlJeXpNbVJoTUhrOHFKdjFXVFFLbDJRNDVRSm02Q0MKK1Iwa2w4bE5WeE1KeEZiVzU3NnhzV1FldXFza1ZTNG9JVS9HSTBnOCtnRFd4NllzaFptTUU1S0IvNzY5VGVrQgppUGY1ZEhYUGh1MVlQZ1BOL1NUcDRUMEFUbXpwbEdNT3kzVk9yMnFrL1hUOHI2c3JBRFQ1VWg4dHJqZlZIVytMCnZoQTNiZERuWll5eFcwaFFTRUFneHRubTNHdWN1WE1pSDNDWkw3QitaQXdRUDlHSFdNTy93Z01JUFhXYlExV3cKb3QvK3dBTERZNzBSWHR3ZXh5VTN1UkxwMG1ObW1oUU5Bd1JSS29BV1JMVHVKMDVYSXhsQUFMVzVBM2RkK09seApLLzM0U0dHNEozcVZ3ckNwalFFQ0F3RUFBYU1qTUNFd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFDZlg1aXZTZnRHWkJOQmtXdlpQa09zWVNOeTkKMUgwSkViVVB2RWswSHVBSDNSK3MyalZJRE1BVjlpYWN4S3RadDVFYnYzc3pDZkNSeXJPVlI4aFFzQ1RHZXBwdwpwc0NvZENPYWNKMUxWUVh6UTAvQkxleTZpYmVObnA4dHNZa1BGaTYwcGw2MTJvSS9ubmc2UERtTEFJaHVKMm9XCmxRZ0pwNzRYZGcxWDNSM24xZUJCL0c4VHlIQkk2U0Q3YjBJK1JRRG9Dd0FmZVFWRXNVUVZEOVRCUnA4QXREelQKcE95VkhSODk0dlV6bEYveEpIZzF0ZDZ4V0U4MHppKzZxbXdnNmRqNEFReVhNOUhHeURMRjIrVGxXNVpwSEp6dgp3UTZWOFVadGZoT0FFTUlYWERvMHBmVngzUEk1aTF3bGxIb1FyUm5oQjRZSHluTW96aWxUL1BvaWdZbz0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
server: https://127.0.0.1:6443
name: apex
contexts:
- context:
cluster: apex
namespace: #Put your assigned namespace here
user: #Put your username here
name: apex-default
current-context: apex-default
kind: Config
preferences: {}
users:
- name: #Put your username here
user:
client-certificate-data: #Put your Base64 client certificate here
client-key-data: #Put your Base64 client access key here

If you are not using kubectl from apex-login host, you have to forward kubectl access to your local machine. The cluster internal certificates can only be validated for API access from internal cluster and localhost (127.0.0.1).

ssh -L 6443:127.0.0.1:6443 username@apex-login.cmkl.ac.th

Validate that you can access the cluster.

kubectl cluster-info

You should get the result similar to the following output.

Kubernetes master is running at https://127.0.0.1:6443
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Simple Commands#

Get a list of the nodes in the cluster:

kubectl get nodes

Our cluster currently consists of 3 masters and 6 worker nodes. Only nodes marked as Ready and not SchedulingDisabled are available for kubernetes jobs.

NAME STATUS ROLES AGE VERSION
archon-1 Ready master 54d v1.18.9
archon-2 Ready master 54d v1.18.9
archon-3 Ready master 54d v1.18.9
prism-1 Ready,SchedulingDisabled <none> 54d v1.18.9
prism-2 Ready,SchedulingDisabled <none> 54d v1.18.9
prism-3 Ready,SchedulingDisabled <none> 54d v1.18.9
prism-4 Ready <none> 54d v1.18.9
prism-5 Ready <none> 54d v1.18.9
prism-6 Ready <none> 54d v1.18.9

Get a list of running pods in your current namespace:

kubectl get pods

Running an Interactive Session#

For an interactive one-off execution, you can use kubectl run to execute a container.

kubectl run -i --rm --tty pytest --image=registry.apex.cmkl.ac.th/nvidia/pytorch:21.05-py3 --restart=Never -- bash

The above command will run a pod name pytest in your namespace using nvidia/pytorch:21.05-py3 from Apex local registry. The operation could take a while to complete. Upon success, you should get a bash prompt to your container. Note that the --rm option will automatically remove the pod after you exit.

If you don't see a command prompt, try pressing enter.
root@pytest:/workspace# nvidia-smi
Thu Jun 3 11:31:14 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 28C P0 53W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 |
| N/A 28C P0 53W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 |
| N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 |
| N/A 29C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off | 0 |
| N/A 33C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off | 0 |
| N/A 31C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off | 0 |
| N/A 32C P0 56W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 |
| N/A 33C P0 59W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@pytest:/workspace# exit

Running a Job#

Running a Simple PyTorch Job#

  1. Run the job.

    A simple PyTorch Job can be run via kubectl by creating the following yaml configuration file:

    pytorch-job.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
    name: pytorch-job
    spec:
    backoffLimit: 5
    template:
    spec:
    containers:
    - name: pytorch-container
    image: registry.apex.cmkl.ac.th/nvidia/pytorch:21.05-py3
    command: ["/bin/sh"]
    args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
    resources:
    limits:
    nvidia.com/gpu: 1
    restartPolicy: Never

    You can then use the following command to run the job.

    kubectl create -f pytorch-job.yaml

    Note that Kubernetes rely on the configuration yaml which you can use to track & versioned control in order to provide consistent run parameters for your experiment.

    The above job will execute a simple mnist dataset training using pytorch. Observe that

    • We are pulling a pytorch container from the local Apex registry
    • The container is limited to a single GPU resource
    • The Kubernetes object we are creating is a job which spawns pod and runs this pod to completion a single time
  2. Check on the job.

    kubectl get jobs
  3. Monitor the pod that's spawned from the job.

    kubectl get pods

    Follow the logs for the pod:

    kubectl logs -f pytorch-job-<pod_id>
  4. Delete the job (and the corresponding pod).

    kubectl delete job pytorch-job

Using NGC Containers with Kubernetes and Launching Jobs#

NVIDIA GPU Cloud (NGC) manages a catalog of fully integrated and optimized DL framework containers that take full advantage of NVIDIA GPUs in both single and multi-GPU configurations. They include NVIDIA CUDA® Toolkit, DIGITS workflow, and the following DL frameworks: NVCaffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, PyTorch, TensorFlow, Theano, and Torch. These framework containers are delivered ready-to-run, including all necessary dependencies such as the CUDA runtime and NVIDIA libraries.

Note that we already have a local registry mirroring NGC on Apex available at https://registry.apex.cmkl.ac.th/. You may also use NGC if the required image is not available on the local registry.

To access the NGC container registry via Kubernetes, add a secret which will be employed when Kubernetes asks NGC to pull container images from it.

  1. Generate an NGC API Key, which will be used for the Kubernetes secret.

  2. Using the NGC API Key, create a Kubernetes secret so that Kubernetes will be able to pull container images from the NGC registry. Create the secret by running the following command on the master (substitute the registered email account and secret in the appropriate locations).

    kubectl create secret docker-registry nvcr.dgxkey --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-email=<email> --docker-password=<NGC API Key>
  3. Check that the secret exists.

    kubectl get secrets
  4. You can now use the secret to pull custom NGC images by using the imagePullSecrets attribute. For example:

    pytorch-job.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
    name: pytorch-ngc-job
    spec:
    backoffLimit: 5
    template:
    spec:
    imagePullSecrets:
    - name: nvcr.dgxkey
    containers:
    - name: pytorch-container
    image: nvcr.io/nvidia/pytorch:21.05-py3
    command: ["/bin/sh"]
    args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
    resources:
    limits:
    nvidia.com/gpu: 1
    restartPolicy: Never

Acknowledgement: The content of this chapter has been adapted from the original DeepOps documentation