That is not guaranteed to work.  There is no streaming concept in the MPI 
standard.  The fundamental issue here is MPI is only asynchronous on the 
completion and not the initiation of the send/recv.

It would be nice if the next version of mpi would look to add something like a 
triggered send or receive that only initiates when it receives a signal saying 
the memory is ready.  This would be vender neutral and enable things like 
streaming.

For example at the end of a kernel which creates data the gpu could poke a 
memory location to signal the send is ready.  Then the IB device could initiate.

Sent from my iPhone

On Nov 28, 2019, at 8:21 AM, George Bosilca via users 
<users@lists.open-mpi.org> wrote:


Wonderful maybe but extremely unportable. Thanks but no thanks!

  George.

On Wed, Nov 27, 2019 at 11:07 PM Zhang, Junchao 
<jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If MPI 
nonblocking calls could take an extra stream argument and work like a kernel 
launch, it would be wonderful.
--Junchao Zhang


On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd 
<josh...@mellanox.com<mailto:josh...@mellanox.com>> wrote:
Why not spawn num_threads, where num_threads is the number of Kernels to launch 
, and compile with the “--default-stream per-thread” option?

Then you could use MPI in thread multiple mode to achieve your objective.

Something like:



void *launch_kernel(void *dummy)
{
    float *data;
    cudaMalloc(&data, N * sizeof(float));

    kernel<<<XX, YY>>>(data, N);

    cudaStreamSynchronize(0);

    MPI_Isend(data,..);
    return NULL;
}

int main()
{
    MPI_init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided);
    const int num_threads = 8;

    pthread_t threads[num_threads];

    for (int i = 0; i < num_threads; i++) {
        if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
            fprintf(stderr, "Error creating threadn");
            return 1;
        }
    }

    for (int i = 0; i < num_threads; i++) {
        if(pthread_join(threads[i], NULL)) {
            fprintf(stderr, "Error joining threadn");
            return 2;
        }
    }
    cudaDeviceReset();

    MPI_Finalize();
}




From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Zhang, Junchao via users
Sent: Wednesday, November 27, 2019 5:43 PM
To: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>
Cc: Zhang, Junchao <jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>>; Open MPI 
Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] CUDA mpi question

I was pointed to "2.7. Synchronization and Memory Ordering" of  
https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fpdf%2FGPUDirect_RDMA.pdf&data=02%7C01%7Cjoshual%40mellanox.com%7C49083a368cab46dbbed908d7738b9386%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637104915623515051&sdata=OLP7ptjhpg3Esqzff9g7%2B7hKWH6xsdRY6HjU2RL01Z8%3D&reserved=0>.
 It is on topic. But unfortunately it is too short and I could not understand 
it.
I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host 
function "must not make any CUDA API calls". I am not sure if MPI_Isend 
qualifies as such functions.
--Junchao Zhang


On Wed, Nov 27, 2019 at 4:18 PM George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao 
<jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)

Then you need a means to ensure sequential execution, and this is what the 
streams provide. Unfortunately, I looked into the code and I'm afraid there is 
currently no realistic way to do what you need. My previous comment was based 
on an older code, that seems to be 1) unmaintained currently, and 2) only 
applicable to the OB1 PML + OpenIB BTL combo. As recent versions of OMPI have 
moved away from the OpenIB BTL, relying more heavily on UCX for Infiniband 
support, the old code is now deprecated. Sorry for giving you hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.




Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or htod streams?

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Hi,
  Suppose I have this piece of code and I use cuda-aware MPI,
              cudaMalloc(&sbuf,sz);
   Kernel1<<<...,stream>>>(...,sbuf);
   MPI_Isend(sbuf,...);
   Kernel2<<<...,stream>>>();

  Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make 
sure data in sbuf is ready to send?  If not, why?

  Thank you.

--Junchao Zhang

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to