Horovod With OpenMPI

*This we testing on APEX system#

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

Requirment#

python 3.8.10
pip 20.0.2
GPU more than 1 gpu

Install#

Install horovod with pip3

pip3 install horovod[all-frameworks]

Example#

ssh <YOUR ACCOUNT>@apex-login.cmkl.ac.th

Clone example from horovod

git clone https://github.com/horovod/horovod.git

Change directory to horovod example code

cd /horovod/examples/tensorflow2

We can run training code with the Horovod framework on the DGX server via Slurm and Openmpi with this command. In this command, we allocate 2 GPU (--gres=gpu:2) and run via mpirun.

    srun --gres=gpu:2 --pty mpirun -np 2 -H localhost:2 --oversubscribe python3 tensorflow2_mnist.py

For this example, You will see some things like this.

    jarukit@archon-2:~/horovod/examples/tensorflow2$ srun --gres=gpu:2 --pty mpirun -np 2 -H localhost:2 --oversubscribe python3 tensorflow2_mnist.py 
srun: job 16322 queued and waiting for resources
srun: job 16322 has been allocated resources



2021-07-13 20:07:59.862412: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-13 20:07:59.862412: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              prism-3
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   prism-3
  Local device: mlx5_0
--------------------------------------------------------------------------
2021-07-13 20:08:40.085734: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-13 20:08:40.085734: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-13 20:08:40.295199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:b7:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.297679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:bd:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.297708: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-13 20:08:40.297939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:b7:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.300353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:bd:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.300379: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-13 20:08:40.317188: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-13 20:08:40.317711: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-13 20:08:40.317799: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-13 20:08:40.318027: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-13 20:08:40.323664: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-13 20:08:40.324046: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-13 20:08:40.326148: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-13 20:08:40.326218: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-13 20:08:40.328306: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-07-13 20:08:40.328546: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-07-13 20:08:40.333805: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-13 20:08:40.334074: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-13 20:08:40.334522: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib:/usr/local/cuda/lib
2021-07-13 20:08:40.334662: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib:/usr/local/cuda/lib
2021-07-13 20:08:40.334709: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-07-13 20:08:40.335017: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-07-13 20:08:41.324206: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-13 20:08:41.324300: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-13 20:08:41.363486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-13 20:08:41.363505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      
2021-07-13 20:08:41.363689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-13 20:08:41.363703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      
[prism-3:3419602] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[prism-3:3419602] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[prism-3:3419602] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
2021-07-13 20:08:48.772704: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-13 20:08:48.772704: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-13 20:08:48.874005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2245655000 Hz
2021-07-13 20:08:48.893431: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2245655000 Hz
Step #0 Loss: 2.327072
Step #10        Loss: 0.667649
Step #20        Loss: 0.546284
Step #30        Loss: 0.313567
Step #40        Loss: 0.441288
Step #50        Loss: 0.197040
Step #60        Loss: 0.245798
Step #70        Loss: 0.187149
Step #80        Loss: 0.254351
Step #90        Loss: 0.151158
Step #100       Loss: 0.167224
Step #110       Loss: 0.196209
Step #120       Loss: 0.053351
Step #130       Loss: 0.148894
Step #140       Loss: 0.075415
Step #150       Loss: 0.232747

Good Luck : )