We have figured this out.  It turns out that the first call to each 
MPI_Isend/Irecv is staged through the host but subsequent calls are not.

Thanks,
Justin

From: Justin Luitjens
Sent: Wednesday, March 30, 2016 9:37 AM
To: us...@open-mpi.org
Subject: CUDA IPC/RDMA Not Working

Hello,

I have installed OpenMPI 1.10.2 with cuda support:

[jluitjens@dt03 repro]$ ompi_info --parsable --all | grep 
mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true


I'm trying to verify that GPU direct is working and that messages aren't 
traversing through the host.  On a K80 GPU I'm starting 2 MPI processes where 
each takes one of the GPUs of the K80.  They then do a send receive of a 
certain size.

In addition,  I'm recording a timeline with nvprof to visualize what is 
happening.  What I'm excepting to happens is there will be one MemCpy D2D on 
each device corresponding to the send and the recive.  However,  What I'm 
seeing is each device  D2H followed by a H2D copy suggesting the data is 
staging through the host.

Here is how I'm currently running the application:

mpirun --mca btl_smcuda_cuda_ipc_verbose 100 --mca btl_smcuda_use_cuda_ipc 1 
--mca btl smcuda,self --mca btl_openib_want_cuda_gdr 1 -np 2 nvprof -o 
profile.%p ./a.out



I'm getting the following diagnostic output:

[dt03:21732] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0
[dt03:21731] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1
[dt03:21731] Not sending CUDA IPC ACK because request already initiated
[dt03:21732] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, 
peerdev=0 --> ACCESS=1
[dt03:21732] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=dt03
[dt03:21732] Sending CUDA IPC ACK:  myrank=1, mydev=1, peerrank=0, peerdev=0
[dt03:21731] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1
[dt03:21731] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=dt03

Here it seems like IPC is correctly being enabled between ranks 0 and 1.

I have tried both very large and very small messages and they all seem to stage 
through the host.

What am I doing wrong?

For reference here is my ompi_info output:

[jluitjens@dt03 repro]$ ompi_info
                 Package: Open MPI jluitjens@dt04 Distribution
                Open MPI: 1.10.2
  Open MPI repo revision: v1.10.1-145-g799148f
   Open MPI release date: Jan 21, 2016
                Open RTE: 1.10.2
  Open RTE repo revision: v1.10.1-145-g799148f
   Open RTE release date: Jan 21, 2016
                    OPAL: 1.10.2
      OPAL repo revision: v1.10.1-145-g799148f
       OPAL release date: Jan 21, 2016
                 MPI API: 3.0.0
            Ident string: 1.10.2
                  Prefix: 
/shared/devtechapps/mpi/gnu-4.7.3/openmpi-1.10.2/cuda-7.5
Configured architecture: x86_64-pc-linux-gnu
          Configure host: dt04
           Configured by: jluitjens
           Configured on: Tue Feb  9 10:56:22 PST 2016
          Configure host: dt04
                Built by: jluitjens
                Built on: Tue Feb  9 11:21:51 PST 2016
              Built host: dt04
              C bindings: yes
            C++ bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (limited: overloading)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc
     C compiler absolute:
  C compiler family name: GNU
      C compiler version: 4.7.3
            C++ compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++
  C++ compiler absolute: none
           Fort compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran
       Fort compiler abs:
         Fort ignore TKR: no
   Fort 08 assumed shape: no
      Fort optional args: no
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: no
      Fort BIND(C) (all): no
      Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): no
       Fort TYPE,BIND(C): no
Fort T,BIND(C,name="a"): no
            Fort PRIVATE: no
          Fort PROTECTED: no
           Fort ABSTRACT: no
       Fort ASYNCHRONOUS: no
          Fort PROCEDURE: no
         Fort USE...ONLY: no
           Fort C_FUNLOC: no
Fort f08 using wrappers: no
         Fort MPI_SIZEOF: no
             C profiling: yes
           C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: no
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
     Symbol vis. support: yes
   Host topology support: yes
          MPI extensions:
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
     VampirTrace support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
            MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
            MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                  MCA db: print (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                  MCA db: hash (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                  MCA dl: dlopen (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA event: libevent2021 (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
               MCA hwloc: hwloc191 (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                  MCA if: posix_ipv4 (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                  MCA if: linux_ipv6 (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
         MCA installdirs: env (MCA v2.0.0, API v2.0.0, Component v1.10.2)
         MCA installdirs: config (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA memory: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA pstat: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA sec: basic (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA shmem: mmap (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA shmem: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA shmem: sysv (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA timer: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA dfs: test (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                 MCA dfs: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                 MCA dfs: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA errmgr: default_orted (MCA v2.0.0, API v3.0.0, Component
                          v1.10.2)
              MCA errmgr: default_app (MCA v2.0.0, API v3.0.0, Component
                          v1.10.2)
              MCA errmgr: default_hnp (MCA v2.0.0, API v3.0.0, Component
                          v1.10.2)
              MCA errmgr: default_tool (MCA v2.0.0, API v3.0.0, Component
                          v1.10.2)
                 MCA ess: env (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                 MCA ess: singleton (MCA v2.0.0, API v3.0.0, Component
                          v1.10.2)
                 MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                 MCA ess: hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                 MCA ess: tool (MCA v2.0.0, API v3.0.0, Component v1.10.2)
               MCA filem: raw (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA grpcomm: bad (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA iof: hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA iof: tool (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA iof: mr_orted (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                 MCA iof: mr_hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA iof: orted (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA odls: default (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA oob: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA plm: isolated (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                 MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA plm: rsh (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA ras: loadleveler (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                 MCA ras: simulator (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                 MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA rmaps: staged (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA rmaps: rank_file (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
               MCA rmaps: ppr (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA rmaps: mindist (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA rmaps: seq (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA rmaps: resilient (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
               MCA rmaps: round_robin (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                 MCA rml: oob (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA routed: debruijn (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
              MCA routed: radix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA routed: binomial (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
              MCA routed: direct (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA state: novm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA state: dvm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA state: hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA state: staged_hnp (MCA v2.0.0, API v1.0.0, Component
                          v1.10.2)
               MCA state: tool (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA state: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA state: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
               MCA state: staged_orted (MCA v2.0.0, API v1.0.0, Component
                          v1.10.2)
           MCA allocator: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
           MCA allocator: bucket (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA bcol: basesmuma (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                MCA bcol: ptpcoll (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA bml: r2 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA btl: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA btl: smcuda (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA btl: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA btl: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA btl: vader (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: inter (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: ml (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: tuned (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: hierarch (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                MCA coll: cuda (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA coll: libnbc (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA dpm: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA fbtl: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA fcoll: static (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA fcoll: two_phase (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
               MCA fcoll: individual (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
               MCA fcoll: dynamic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA fcoll: ylib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                  MCA fs: ufs (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                  MCA io: romio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                  MCA io: ompio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA mpool: rgpusm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA mpool: gpusm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA mpool: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA mpool: grdma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA mtl: psm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA osc: sm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                 MCA osc: pt2pt (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                 MCA pml: v (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA pml: cm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA pml: ob1 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA pml: bfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA pubsub: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rcache: vma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA rte: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA sbgp: basesmuma (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                MCA sbgp: basesmsocket (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                MCA sbgp: p2p (MCA v2.0.0, API v2.0.0, Component v1.10.2)
            MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
            MCA sharedfp: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
            MCA sharedfp: lockedfile (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)
                MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
           MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component
                          v1.10.2)


Thanks,
Justin


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to