Kubernetes Usage Guide
#
IntroductionOur current guide assumes that the user has a Linux-based environment and has acquired the access token from the administrator.
During our trial period, please fill out the following Apex Trial Request Form to get access. Please also indicate that you need kubernetes access.
#
Accessing Kubernetes ClusterYou can access the cluster by using the provided kubectl
client on apex-login
. Alternatively, you can also install kubectl
on your local terminal and apply the following kubernetes config. Please refer to the following kubernetes guide for installing kubectl client.
#
LoginUsers should obtain the client access token provided by the administrator. Due to existing cluster limitation, we have to manually validate your credentials before provisioning your access token. Please contact apex@cmkl.ac.th to request your log-in token.
Note that we DO NOT keep your client access key. If you lose your access key, you will have to go thru the validation process to acquire the new token again.
You can use the following ~/.kube/config
file for your kubeconfig template. Make sure to replace the comment with your assigned namespace, username, client key & certificate.
If you are not using kubectl
from apex-login
host, you have to forward kubectl
access to your local machine. The cluster internal certificates can only be validated for API access from internal cluster and localhost (127.0.0.1).
Validate that you can access the cluster.
You should get the result similar to the following output.
#
Simple CommandsGet a list of the nodes in the cluster:
Our cluster currently consists of 3 masters and 6 worker nodes. Only nodes marked as Ready
and not SchedulingDisabled
are available for kubernetes jobs.
Get a list of running pods in your current namespace:
#
Running an Interactive SessionFor an interactive one-off execution, you can use kubectl run
to execute a container.
The above command will run a pod name pytest
in your namespace using nvidia/pytorch:21.05-py3
from Apex local registry. The operation could take a while to complete. Upon success, you should get a bash prompt to your container. Note that the --rm
option will automatically remove the pod after you exit.
#
Running a Job#
Running a Simple PyTorch JobRun the job.
A simple PyTorch Job can be run via
kubectl
by creating the followingyaml
configuration file:pytorch-job.yamlYou can then use the following command to run the job.
Note that Kubernetes rely on the configuration
yaml
which you can use to track & versioned control in order to provide consistent run parameters for your experiment.The above job will execute a simple
mnist
dataset training usingpytorch
. Observe that- We are pulling a pytorch container from the local Apex registry
- The container is limited to a single GPU resource
- The Kubernetes object we are creating is a
job
which spawnspod
and runs this pod to completion a single time
Check on the job.
Monitor the pod that's spawned from the job.
Follow the logs for the pod:
Delete the job (and the corresponding pod).
#
Using NGC Containers with Kubernetes and Launching JobsNVIDIA GPU Cloud (NGC) manages a catalog of fully integrated and optimized DL framework containers that take full advantage of NVIDIA GPUs in both single and multi-GPU configurations. They include NVIDIA CUDA® Toolkit, DIGITS workflow, and the following DL frameworks: NVCaffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, PyTorch, TensorFlow, Theano, and Torch. These framework containers are delivered ready-to-run, including all necessary dependencies such as the CUDA runtime and NVIDIA libraries.
Note that we already have a local registry mirroring NGC on Apex available at https://registry.apex.cmkl.ac.th/. You may also use NGC if the required image is not available on the local registry.
To access the NGC container registry via Kubernetes, add a secret which will be employed when Kubernetes asks NGC to pull container images from it.
Generate an NGC API Key, which will be used for the Kubernetes secret.
- Login to the NGC Registry at https://ngc.nvidia.com/
- Go to https://ngc.nvidia.com/configuration/api-key
- Click on GENERATE API KEY
Using the NGC API Key, create a Kubernetes secret so that Kubernetes will be able to pull container images from the NGC registry. Create the secret by running the following command on the master (substitute the registered email account and secret in the appropriate locations).
Check that the secret exists.
You can now use the secret to pull custom NGC images by using the
imagePullSecrets
attribute. For example:pytorch-job.yaml
Acknowledgement: The content of this chapter has been adapted from the original DeepOps documentation