Re: [OMPI users] CUDA IPC/RDMA Not Working
We have figured this out. It turns out that the first call to each MPI_Isend/Irecv is staged through the host but subsequent calls are not. Thanks, Justin From: Justin Luitjens Sent: Wednesday, March 30, 2016 9:37 AM To: us...@open-mpi.org Subject: CUDA IPC/RDMA Not Working Hello, I have installed OpenMPI 1.10.2 with cuda support: [jluitjens@dt03 repro]$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value mca:mpi:base:param:mpi_built_with_cuda_support:value:true I'm trying to verify that GPU direct is working and that messages aren't traversing through the host. On a K80 GPU I'm starting 2 MPI processes where each takes one of the GPUs of the K80. They then do a send receive of a certain size. In addition, I'm recording a timeline with nvprof to visualize what is happening. What I'm excepting to happens is there will be one MemCpy D2D on each device corresponding to the send and the recive. However, What I'm seeing is each device D2H followed by a H2D copy suggesting the data is staging through the host. Here is how I'm currently running the application: mpirun --mca btl_smcuda_cuda_ipc_verbose 100 --mca btl_smcuda_use_cuda_ipc 1 --mca btl smcuda,self --mca btl_openib_want_cuda_gdr 1 -np 2 nvprof -o profile.%p ./a.out I'm getting the following diagnostic output: [dt03:21732] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0 [dt03:21731] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1 [dt03:21731] Not sending CUDA IPC ACK because request already initiated [dt03:21732] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, peerdev=0 --> ACCESS=1 [dt03:21732] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=dt03 [dt03:21732] Sending CUDA IPC ACK: myrank=1, mydev=1, peerrank=0, peerdev=0 [dt03:21731] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1 [dt03:21731] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=dt03 Here it seems like IPC is correctly being enabled between ranks 0 and 1. I have tried both very large and very small messages and they all seem to stage through the host. What am I doing wrong? For reference here is my ompi_info output: [jluitjens@dt03 repro]$ ompi_info Package: Open MPI jluitjens@dt04 Distribution Open MPI: 1.10.2 Open MPI repo revision: v1.10.1-145-g799148f Open MPI release date: Jan 21, 2016 Open RTE: 1.10.2 Open RTE repo revision: v1.10.1-145-g799148f Open RTE release date: Jan 21, 2016 OPAL: 1.10.2 OPAL repo revision: v1.10.1-145-g799148f OPAL release date: Jan 21, 2016 MPI API: 3.0.0 Ident string: 1.10.2 Prefix: /shared/devtechapps/mpi/gnu-4.7.3/openmpi-1.10.2/cuda-7.5 Configured architecture: x86_64-pc-linux-gnu Configure host: dt04 Configured by: jluitjens Configured on: Tue Feb 9 10:56:22 PST 2016 Configure host: dt04 Built by: jluitjens Built on: Tue Feb 9 11:21:51 PST 2016 Built host: dt04 C bindings: yes C++ bindings: yes Fort mpif.h: yes (all) Fort use mpi: yes (limited: overloading) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: no Fort mpi_f08 compliance: The mpi_f08 module was not built Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc C compiler absolute: C compiler family name: GNU C compiler version: 4.7.3 C++ compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++ C++ compiler absolute: none Fort compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran Fort compiler abs: Fort ignore TKR: no Fort 08 assumed shape: no Fort optional args: no Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: no Fort BIND(C) (all): no Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): no Fort TYPE,BIND(C): no Fort T,BIND(C,name="a"): no Fort PRIVATE: no Fort PROTECTED: no Fort ABSTRACT: no Fort ASYNCHRONOUS: no Fort PROCEDURE: no Fort USE...ONLY: no Fort C_FUNLOC: no Fort f08 using wrappers: no Fort MPI_SIZEOF: no C profiling: yes C++ profiling: yes Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: no C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl
[OMPI users] CUDA IPC/RDMA Not Working
Hello, I have installed OpenMPI 1.10.2 with cuda support: [jluitjens@dt03 repro]$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value mca:mpi:base:param:mpi_built_with_cuda_support:value:true I'm trying to verify that GPU direct is working and that messages aren't traversing through the host. On a K80 GPU I'm starting 2 MPI processes where each takes one of the GPUs of the K80. They then do a send receive of a certain size. In addition, I'm recording a timeline with nvprof to visualize what is happening. What I'm excepting to happens is there will be one MemCpy D2D on each device corresponding to the send and the recive. However, What I'm seeing is each device D2H followed by a H2D copy suggesting the data is staging through the host. Here is how I'm currently running the application: mpirun --mca btl_smcuda_cuda_ipc_verbose 100 --mca btl_smcuda_use_cuda_ipc 1 --mca btl smcuda,self --mca btl_openib_want_cuda_gdr 1 -np 2 nvprof -o profile.%p ./a.out I'm getting the following diagnostic output: [dt03:21732] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0 [dt03:21731] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1 [dt03:21731] Not sending CUDA IPC ACK because request already initiated [dt03:21732] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, peerdev=0 --> ACCESS=1 [dt03:21732] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=dt03 [dt03:21732] Sending CUDA IPC ACK: myrank=1, mydev=1, peerrank=0, peerdev=0 [dt03:21731] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1 [dt03:21731] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=dt03 Here it seems like IPC is correctly being enabled between ranks 0 and 1. I have tried both very large and very small messages and they all seem to stage through the host. What am I doing wrong? For reference here is my ompi_info output: [jluitjens@dt03 repro]$ ompi_info Package: Open MPI jluitjens@dt04 Distribution Open MPI: 1.10.2 Open MPI repo revision: v1.10.1-145-g799148f Open MPI release date: Jan 21, 2016 Open RTE: 1.10.2 Open RTE repo revision: v1.10.1-145-g799148f Open RTE release date: Jan 21, 2016 OPAL: 1.10.2 OPAL repo revision: v1.10.1-145-g799148f OPAL release date: Jan 21, 2016 MPI API: 3.0.0 Ident string: 1.10.2 Prefix: /shared/devtechapps/mpi/gnu-4.7.3/openmpi-1.10.2/cuda-7.5 Configured architecture: x86_64-pc-linux-gnu Configure host: dt04 Configured by: jluitjens Configured on: Tue Feb 9 10:56:22 PST 2016 Configure host: dt04 Built by: jluitjens Built on: Tue Feb 9 11:21:51 PST 2016 Built host: dt04 C bindings: yes C++ bindings: yes Fort mpif.h: yes (all) Fort use mpi: yes (limited: overloading) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: no Fort mpi_f08 compliance: The mpi_f08 module was not built Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc C compiler absolute: C compiler family name: GNU C compiler version: 4.7.3 C++ compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++ C++ compiler absolute: none Fort compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran Fort compiler abs: Fort ignore TKR: no Fort 08 assumed shape: no Fort optional args: no Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: no Fort BIND(C) (all): no Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): no Fort TYPE,BIND(C): no Fort T,BIND(C,name="a"): no Fort PRIVATE: no Fort PROTECTED: no Fort ABSTRACT: no Fort ASYNCHRONOUS: no Fort PROCEDURE: no Fort USE...ONLY: no Fort C_FUNLOC: no Fort f08 using wrappers: no Fort MPI_SIZEOF: no C profiling: yes C++ profiling: yes Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: no C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol vis. support: yes Host topology support: yes MPI extensions: FT Checkpoint support: no (checkpoint thread: no) C/R