Hi Justin,

Quick grepping reveals several cuMemcpy calls in OpenMPI. Some of them are
even synchronous, meaning stream0.

I think the best way of exploring this sort of behavior is to execute
OpenMPI runtime (thanks to its open-source nature!) under debugger. Rebuild
OpenMPI with -g -O0, add some initial sleep() into your app, such that this
time would be sufficient to gdb-attach to one of MPI processes. Once
attached, first put break on the beginning of your region of interest and
then break on cuMemcpy and cuMemcpyAsync.

Best,
- D.

2012/12/13 Justin Luitjens <jluitj...@nvidia.com>

> Hello,
>
> I'm working on an application using OpenMPI with CUDA and GPUDirect.  I
> would like to get the MPI transfers to overlap with computation on the CUDA
> device.  To do this I need to ensure that all memory transfers do not go to
> stream 0.  In this application I have one step that performs an
> MPI_Alltoall operation.  Ideally I would like this Alltoall operation to be
> asynchronous.  Thus I have implemented my own Alltoall using Isend and
> Irecv.  Which can be found at the bottom of this email.
>
> The profiler shows that this operation has some very odd PCI-E traffic
> that I was hoping someone could explain and help me eliminate.  In this
> example NPES=2 and each process has its own M2090 GPU.  I am using cuda 5.0
> and OpenMPI-1.7rc5.  The behavior I am seeing is the following.  Once the
> Isend loop occurs there is a sequence of DtoH followed by HtoD transfers.
>  These transfers are 256K in size and there are 28 of them that occur.
>  Each of these transfers are placed in stream0.  After this there are a few
> more small transfers also placed in stream0.  Finally when the 3rd loop
> occurs there are 2 DtoD transfers (this is the actual data being exchanged).
>
> Can anyone explain what all of the traffic ping-ponging back and forth
> between the host and device is?  Is this traffic necessary?
>
> Thanks,
> Justin
>
>
> uint64_t scatter_gather( uint128 * input_buffer, uint128 *output_buffer,
> uint128 *recv_buckets, int* send_sizes, int MAX_RECV_SIZE_PER_PE) {
>
>   std::vector<MPI_Request> srequest(NPES), rrequest(NPES);
>
>   //Start receives
>   for(int p=0;p<NPES;p++) {
>
> MPI_Irecv(recv_buckets+MAX_RECV_SIZE_PER_PE*p,MAX_RECV_SIZE_PER_PE,MPI_INT_128,p,0,MPI_COMM_WORLD,&rrequest[p]);
>   }
>
>   //Start sends
>   int send_count=0;
>   for(int p=0;p<NPES;p++) {
>
> MPI_Isend(input_buffer+send_count,send_sizes[p],MPI_INT_128,p,0,MPI_COMM_WORLD,&srequest[p]);
>     send_count+=send_sizes[p];
>   }
>
>   //Process outstanding receives
>   int recv_count=0;
>   for(int p=0;p<NPES;p++) {
>     MPI_Status status;
>     MPI_Wait(&rrequest[p],&status);
>     int count;
>     MPI_Get_count(&status,MPI_INT_128,&count);
>     assert(count<MAX_RECV_SIZE_PER_PE);
>
> cudaMemcpy(output_buffer+recv_count,recv_buckets+MAX_RECV_SIZE_PER_PE*p,count*sizeof(uint128),cudaMemcpyDeviceToDevice);
>     recv_count+=count;
>   }
>
>   //Wait for outstanding sends
>   for(int p=0;p<NPES;p++) {
>     MPI_Status status;
>     MPI_Wait(&srequest[p],&status);
>   }
>   return recv_count;
> }
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to