Hi,
I'm trying to troubleshoot a problem, we don't seem to be getting the
bandwidth we'd expect from our distributed CUDA program, where we're
using Open MPI to pass data between GPUs in a HPC cluster.
I thought I found a possible root cause, but now I'm unsure of how to
fix this, since the documentation provides conflicting information.
Running
ompi_info --all| grep "MCA btl"
gives me the following output:
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.2)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.0.2)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.2)
According to this: https://www.open-mpi.org/faq/?category=runcuda, the
openib btl is a prerequisite for GPUDirect RDMA.
However, I'm also reading that UCX is the preferred way to do RDMA and
that it has CUDA support.
Can anyone tell me what a proper configuration for GPUDirect RDMA over
Infiniband looks like?
Best regards,
Oskar Lappi