Hi Justin, Quick grepping reveals several cuMemcpy calls in OpenMPI. Some of them are even synchronous, meaning stream0.
I think the best way of exploring this sort of behavior is to execute OpenMPI runtime (thanks to its open-source nature!) under debugger. Rebuild OpenMPI with -g -O0, add some initial sleep() into your app, such that this time would be sufficient to gdb-attach to one of MPI processes. Once attached, first put break on the beginning of your region of interest and then break on cuMemcpy and cuMemcpyAsync. Best, - D. 2012/12/13 Justin Luitjens <jluitj...@nvidia.com> > Hello, > > I'm working on an application using OpenMPI with CUDA and GPUDirect. I > would like to get the MPI transfers to overlap with computation on the CUDA > device. To do this I need to ensure that all memory transfers do not go to > stream 0. In this application I have one step that performs an > MPI_Alltoall operation. Ideally I would like this Alltoall operation to be > asynchronous. Thus I have implemented my own Alltoall using Isend and > Irecv. Which can be found at the bottom of this email. > > The profiler shows that this operation has some very odd PCI-E traffic > that I was hoping someone could explain and help me eliminate. In this > example NPES=2 and each process has its own M2090 GPU. I am using cuda 5.0 > and OpenMPI-1.7rc5. The behavior I am seeing is the following. Once the > Isend loop occurs there is a sequence of DtoH followed by HtoD transfers. > These transfers are 256K in size and there are 28 of them that occur. > Each of these transfers are placed in stream0. After this there are a few > more small transfers also placed in stream0. Finally when the 3rd loop > occurs there are 2 DtoD transfers (this is the actual data being exchanged). > > Can anyone explain what all of the traffic ping-ponging back and forth > between the host and device is? Is this traffic necessary? > > Thanks, > Justin > > > uint64_t scatter_gather( uint128 * input_buffer, uint128 *output_buffer, > uint128 *recv_buckets, int* send_sizes, int MAX_RECV_SIZE_PER_PE) { > > std::vector<MPI_Request> srequest(NPES), rrequest(NPES); > > //Start receives > for(int p=0;p<NPES;p++) { > > MPI_Irecv(recv_buckets+MAX_RECV_SIZE_PER_PE*p,MAX_RECV_SIZE_PER_PE,MPI_INT_128,p,0,MPI_COMM_WORLD,&rrequest[p]); > } > > //Start sends > int send_count=0; > for(int p=0;p<NPES;p++) { > > MPI_Isend(input_buffer+send_count,send_sizes[p],MPI_INT_128,p,0,MPI_COMM_WORLD,&srequest[p]); > send_count+=send_sizes[p]; > } > > //Process outstanding receives > int recv_count=0; > for(int p=0;p<NPES;p++) { > MPI_Status status; > MPI_Wait(&rrequest[p],&status); > int count; > MPI_Get_count(&status,MPI_INT_128,&count); > assert(count<MAX_RECV_SIZE_PER_PE); > > cudaMemcpy(output_buffer+recv_count,recv_buckets+MAX_RECV_SIZE_PER_PE*p,count*sizeof(uint128),cudaMemcpyDeviceToDevice); > recv_count+=count; > } > > //Wait for outstanding sends > for(int p=0;p<NPES;p++) { > MPI_Status status; > MPI_Wait(&srequest[p],&status); > } > return recv_count; > } > > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > > ----------------------------------------------------------------------------------- > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >