Hi Justin

from looking at your code it seems you are receiving more bytes from the 
processors then you send (I assume MAX_RECV_SIZE_PER_PE > send_sizes[p]).
I don't think this is valid. Your transfers should have matched sizes on the 
sending and receiving side. To achieve this, either communicate the message 
size before exchanging the actual data
(a simple MPI_Isend/MPI_Irecv pair with one MPI_INT will do), or use a 
mechanism provided by the MPI library for this. I believe MPI_Probe is made for 
this purpose.

As to why the transfers occur, my wild guess would be: you have set 
MAX_RECV_SIZE_PER_PE to something large, which would explain the size and 
number of the H2D transfers. 
I am just guessing, but maybe OMPI divides the data into chunks. Unless you are 
using intra-node Peer2Peer (smcuda), all MPI traffic has to go through the 
host, therefore the copies.
I don't know what causes the D2H transfers to be of the same size, the library 
might be doing something strange here, given that you have potentially
asked it to receive more data then you send - don't do that. Your third loop 
actually does not exchange the data, as you wrote, it just does an extra 
copying of data which in principle you could
avoid by sending the message sizes first.

Concerning your question about asynchronous copying. If you are using device 
buffers (and it seems you do) for MPI, then you will have to rely on the 
library to do asynchronous
copying of the buffers (cudaMemcpyAsync) for you. I don't know if OpenMPI does 
this, you could check the source. I think MVAPICH2 does. If you really want 
control over the streams,
you have to the D2H/H2D copying yourself, which is fine unless you are relying 
on peer-to-peer capability - but it seems you don't. If you are manually 
copying the data
you can give any stream parameter to the cudaMemcpyAsync calls you prefer.

My general experiences can be summarized as: achieving true async MPI 
computation is hard if using the CUDA support of the library, but very easy if 
you are using only the host
routines of MPI. Since your kernel calls are async with respect to host 
already, all you have to do is asynchronously copy the data between host and 
device.

Jens

On Dec 12, 2012, at 6:30 PM, Justin Luitjens wrote:

> Hello,
> 
> I'm working on an application using OpenMPI with CUDA and GPUDirect.  I would 
> like to get the MPI transfers to overlap with computation on the CUDA device. 
>  To do this I need to ensure that all memory transfers do not go to stream 0. 
>  In this application I have one step that performs an MPI_Alltoall operation. 
>  Ideally I would like this Alltoall operation to be asynchronous.  Thus I 
> have implemented my own Alltoall using Isend and Irecv.  Which can be found 
> at the bottom of this email.
> 
> The profiler shows that this operation has some very odd PCI-E traffic that I 
> was hoping someone could explain and help me eliminate.  In this example 
> NPES=2 and each process has its own M2090 GPU.  I am using cuda 5.0 and 
> OpenMPI-1.7rc5.  The behavior I am seeing is the following.  Once the Isend 
> loop occurs there is a sequence of DtoH followed by HtoD transfers.  These 
> transfers are 256K in size and there are 28 of them that occur.  Each of 
> these transfers are placed in stream0.  After this there are a few more small 
> transfers also placed in stream0.  Finally when the 3rd loop occurs there are 
> 2 DtoD transfers (this is the actual data being exchanged).  
> 
> Can anyone explain what all of the traffic ping-ponging back and forth 
> between the host and device is?  Is this traffic necessary? 
> 
> Thanks,
> Justin
> 
> 
> uint64_t scatter_gather( uint128 * input_buffer, uint128 *output_buffer, 
> uint128 *recv_buckets, int* send_sizes, int MAX_RECV_SIZE_PER_PE) {
> 
>  std::vector<MPI_Request> srequest(NPES), rrequest(NPES);
> 
>  //Start receives
>  for(int p=0;p<NPES;p++) {
>    
> MPI_Irecv(recv_buckets+MAX_RECV_SIZE_PER_PE*p,MAX_RECV_SIZE_PER_PE,MPI_INT_128,p,0,MPI_COMM_WORLD,&rrequest[p]);
>  }
> 
>  //Start sends
>  int send_count=0;
>  for(int p=0;p<NPES;p++) {
>    
> MPI_Isend(input_buffer+send_count,send_sizes[p],MPI_INT_128,p,0,MPI_COMM_WORLD,&srequest[p]);
>    send_count+=send_sizes[p];
>  }
> 
>  //Process outstanding receives
>  int recv_count=0;
>  for(int p=0;p<NPES;p++) {
>    MPI_Status status;
>    MPI_Wait(&rrequest[p],&status);
>    int count;
>    MPI_Get_count(&status,MPI_INT_128,&count);
>    assert(count<MAX_RECV_SIZE_PER_PE);
>    
> cudaMemcpy(output_buffer+recv_count,recv_buckets+MAX_RECV_SIZE_PER_PE*p,count*sizeof(uint128),cudaMemcpyDeviceToDevice);
>    recv_count+=count;
>  }
> 
>  //Wait for outstanding sends
>  for(int p=0;p<NPES;p++) {
>    MPI_Status status;
>    MPI_Wait(&srequest[p],&status);
>  }
>  return recv_count;
> }
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to