Skip to main content

Horovod With OpenMPI

*This we testing on APEX system#

Horovod Repo

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.


  • python 3.8.10
  • pip 20.0.2
  • GPU more than 1 gpu


Install horovod with pip3

pip3 install horovod[all-frameworks]


Login to APEX


Clone example from horovod

git clone

Change directory to horovod example code

cd /horovod/examples/tensorflow2

We can run training code with the Horovod framework on the DGX server via Slurm and Openmpi with this command. In this command, we allocate 2 GPU (--gres=gpu:2) and run via mpirun.

srun --gres=gpu:2 --pty mpirun -np 2 -H localhost:2 --oversubscribe python3

For this example, You will see some things like this.

jarukit@archon-2:~/horovod/examples/tensorflow2$ srun --gres=gpu:2 --pty mpirun -np 2 -H localhost:2 --oversubscribe python3
srun: job 16322 queued and waiting for resources
srun: job 16322 has been allocated resources
2021-07-13 20:07:59.862412: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:07:59.862412: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: prism-3
Local adapter: mlx5_0
Local port: 1
WARNING: There was an error initializing an OpenFabrics device.
Local host: prism-3
Local device: mlx5_0
2021-07-13 20:08:40.085734: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.085734: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.295199: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
pciBusID: 0000:b7:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.297679: I tensorflow/core/common_runtime/gpu/] Found device 1 with properties:
pciBusID: 0000:bd:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.297708: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.297939: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
pciBusID: 0000:b7:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.300353: I tensorflow/core/common_runtime/gpu/] Found device 1 with properties:
pciBusID: 0000:bd:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-13 20:08:40.300379: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.317188: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.317711: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.317799: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.318027: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.323664: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.324046: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.326148: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.326218: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.328306: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.328546: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.333805: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.334074: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2021-07-13 20:08:40.334522: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib:/usr/local/cuda/lib
2021-07-13 20:08:40.334662: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib:/usr/local/cuda/lib
2021-07-13 20:08:40.334709: W tensorflow/core/common_runtime/gpu/] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-07-13 20:08:40.335017: W tensorflow/core/common_runtime/gpu/] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-07-13 20:08:41.324206: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-13 20:08:41.324300: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-13 20:08:41.363486: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-13 20:08:41.363505: I tensorflow/core/common_runtime/gpu/]
2021-07-13 20:08:41.363689: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-13 20:08:41.363703: I tensorflow/core/common_runtime/gpu/]
[prism-3:3419602] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[prism-3:3419602] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[prism-3:3419602] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
2021-07-13 20:08:48.772704: I tensorflow/compiler/mlir/] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-13 20:08:48.772704: I tensorflow/compiler/mlir/] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-13 20:08:48.874005: I tensorflow/core/platform/profile_utils/] CPU Frequency: 2245655000 Hz
2021-07-13 20:08:48.893431: I tensorflow/core/platform/profile_utils/] CPU Frequency: 2245655000 Hz
Step #0 Loss: 2.327072
Step #10 Loss: 0.667649
Step #20 Loss: 0.546284
Step #30 Loss: 0.313567
Step #40 Loss: 0.441288
Step #50 Loss: 0.197040
Step #60 Loss: 0.245798
Step #70 Loss: 0.187149
Step #80 Loss: 0.254351
Step #90 Loss: 0.151158
Step #100 Loss: 0.167224
Step #110 Loss: 0.196209
Step #120 Loss: 0.053351
Step #130 Loss: 0.148894
Step #140 Loss: 0.075415
Step #150 Loss: 0.232747

Good Luck : )