Hi,

 I'm trying to troubleshoot a problem, we don't seem to be getting the bandwidth we'd expect from our distributed CUDA program, where we're using Open MPI to pass data between GPUs in a HPC cluster.

I thought I found a possible root cause, but now I'm unsure of how to fix this, since the documentation provides conflicting information.

Running

    ompi_info --all| grep "MCA btl"

gives me the following output:

                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.2)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.0.2)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.2)

According to this: https://www.open-mpi.org/faq/?category=runcuda, the openib btl is a prerequisite for GPUDirect RDMA.

However, I'm also reading that UCX is the preferred way to do RDMA and that it has CUDA support.

Can anyone tell me what a proper configuration for GPUDirect RDMA over Infiniband looks like?

Best regards,

Oskar Lappi

Reply via email to