Hi, Arturo. Usually, for OpenMPI+UCX we use the following recipe
for UCX: ./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda --with-gdrcopy=/usr make -j install then OpenMPI: ./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install make -j install Can you run with the following to see if it helps: mpirun -np 2 --mca pml ucx --mca btl ^smcuda,openib ./osu_latency D H There are details here that may be useful: https://www.open-mpi.org/faq/?category=runcuda#run-ompi-cuda-ucx Also, note that for short messages D->H path for inter-node may not involve call CUDA API (if you're using nvprof to detect CUDA activity) because GPUDirectRDMA path and gdrcopy is used. On Fri, Sep 6, 2019 at 7:36 AM Arturo Fernandez via users < users@lists.open-mpi.org> wrote: > Josh, > Thank you. Yes, I built UCX with CUDA and gdrcopy support. I also had to > disable numa (--disable-numa) as requested during the installation. > AFernandez > > Joshua Ladd wrote > > Did you build UCX with CUDA support (--with-cuda) ? > > Josh > > On Thu, Sep 5, 2019 at 8:45 PM AFernandez via users < users@lists.open-mpi.org > > wrote: > >> Hello OpenMPI Team, >> >> I'm trying to use CUDA-aware OpenMPI but the system simply ignores the >> GPU and the code runs on the CPUs. I've tried different software but will >> focus on the OSU benchmarks (collective and pt2pt communications). Let me >> provide some data about the configuration of the system: >> >> -OFED v4.17-1-rc2 (the NIC is virtualized but I also tried a Mellanox >> card with MOFED a few days ago and found the same issue) >> >> -CUDA v10.1 >> >> -gdrcopy v1.3 >> >> -UCX 1.6.0 >> >> -OpenMPI 4.0.1 >> >> Everything looks like good (CUDA programs work fine, MPI programs run on >> the CPUs without any problem), and the ompi_info outputs what I was >> expecting (but maybe I'm missing something): >> >> mca:opal:base:param:opal_built_with_cuda_support:synonym:name:mpi_built_with_cuda_support >> >> >> mca:mpi:base:param:mpi_built_with_cuda_support:value:true >> >> mca:mpi:base:param:mpi_built_with_cuda_support:source:default >> >> mca:mpi:base:param:mpi_built_with_cuda_support:status:read-only >> >> mca:mpi:base:param:mpi_built_with_cuda_support:level:4 >> >> mca:mpi:base:param:mpi_built_with_cuda_support:help:Whether CUDA GPU >> buffer support is built into library or not >> >> mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:0:false >> >> mca:mpi:base:param:mpi_built_with_cuda_support:enumerator:value:1:true >> >> mca:mpi:base:param:mpi_built_with_cuda_support:deprecated:no >> >> mca:mpi:base:param:mpi_built_with_cuda_support:type:bool >> >> mca:mpi:base:param:mpi_built_with_cuda_support:synonym_of:name:opal_built_with_cuda_support >> >> >> mca:mpi:base:param:mpi_built_with_cuda_support:disabled:false >> >> The available btls are the usual self, openib, tcp & vader plus smcuda, >> uct & usnic. The full output from ompi_info is attached. If I try the flag >> '--mca opal_cuda_verbose 10,' it doesn't output anything, which seems to >> agree with the lack of GPU use. If I try with '--mca btl smcuda,' it makes >> no difference. I have also tried to specify the program to use host and >> device (e.g. mpirun -np 2 ./osu_latency D H) but the same result. I am >> probably missing something but not sure where else to look at or what else >> to try. >> >> Thank you, >> >> AFernandez >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- -Akshay NVIDIA
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users