Thanks George for the explanation,

with the default eager size, the first message is received *after* the
last message is sent, regardless the progress thread is used or not.
an other way to put it is that MPI_Isend() (and probably MPI_Irecv()
too) do not involve any progression,
so i naively thought the progress thread would have helped here.

just to be 100% sure, could you please confirm this is the intended
behavior and not a bug ?

Cheers,

Gilles

On Sat, Jul 22, 2017 at 5:00 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
>
> On Thu, Jul 20, 2017 at 8:57 PM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>>
>> Sam,
>>
>>
>> this example is using 8 MB size messages
>>
>> if you are fine with using more memory, and your application should not
>> generate too much unexpected messages, then you can bump the eager_limit
>> for example
>>
>> mpirun --mca btl_tcp_eager_limit $((8*1024*1024+128)) ...
>>
>> worked for me
>
>
> Ah, interesting. If forcing a very large eager then the problem might be
> coming from the pipelining algorithm. Not a good solution in general, but
> handy to see what's going on. As many sends are available, the pipelining
> might be overwhelmed and interleave fragments from different requests. Let
> me dig a little bit here, I think I know exactly what is going on.
>
>>
>> George,
>>
>> in master, i thought
>>
>> mpirun --mca btl_tcp_progress_thread 1 ...
>>
>> would help but it did not.
>> did i misunderstand the purpose of the TCP progress thread ?
>
>
> Gilles,
>
> In this example most of the time is spent in an MPI_* function (mainly the
> MPI_Wait), so the progress thread has little opportunity to help. The role
> of the progress thread is to make sure communications are progressed when
> the application is not into an MPI call.
>
>   George.
>
>
>
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On 7/21/2017 9:05 AM, George Bosilca wrote:
>>>
>>> Sam,
>>>
>>> Open MPI aggregates messages only when network constraints prevent the
>>> messages from being timely delivered. In this particular case I think that
>>> our delayed business card exchange and connection setup is delaying the
>>> delivery of the first batch of messages (and the BTL will aggregate them
>>> while waiting for the connection to be correctly setup).
>>>
>>> Can you reproduce the same behavior after the first batch of messages ?
>>>
>>> Assuming the times showed on the left of your messages are correct, the
>>> first MPI seems to deliver the entire set of messages significantly faster
>>> than the second.
>>>
>>>   George.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 20, 2017 at 5:42 PM, Samuel Thibault
>>> <samuel.thiba...@labri.fr <mailto:samuel.thiba...@labri.fr>> wrote:
>>>
>>>     Hello,
>>>
>>>     We are getting a strong performance issue, which is due to a missing
>>>     pipelining behavior from OpenMPI when running over TCP. I have
>>>     attached
>>>     a test case. Basically what it does is
>>>
>>>     if (myrank == 0) {
>>>             for (i = 0; i < N; i++)
>>>                     MPI_Isend(...);
>>>     } else {
>>>             for (i = 0; i < N; i++)
>>>                     MPI_Irecv(...);
>>>     }
>>>     for (i = 0; i < N; i++)
>>>             MPI_Wait(...);
>>>
>>>     with corresponding printfs. And the result is:
>>>
>>>     0.182620: Isend 0 begin
>>>     0.182761: Isend 0 end
>>>     0.182766: Isend 1 begin
>>>     0.182782: Isend 1 end
>>>     ...
>>>     0.183911: Isend 49 begin
>>>     0.183915: Isend 49 end
>>>     0.199028: Irecv 0 begin
>>>     0.199068: Irecv 0 end
>>>     0.199070: Irecv 1 begin
>>>     0.199072: Irecv 1 end
>>>     ...
>>>     0.199187: Irecv 49 begin
>>>     0.199188: Irecv 49 end
>>>     0.233948: Isend 0 done!
>>>     0.269895: Isend 1 done!
>>>     ...
>>>     1.982475: Isend 49 done!
>>>     1.984065: Irecv 0 done!
>>>     1.984078: Irecv 1 done!
>>>     ...
>>>     1.984131: Irecv 49 done!
>>>
>>>     i.e. almost two seconds happen between the start of the
>>>     application and
>>>     the first Irecv completes, and then all Irecv complete immediately
>>>     too,
>>>     i.e. it seems the communications were grouped altogether.
>>>
>>>     This is really bad, because in our real use case, we trigger
>>>     computations after each MPI_Wait calls, and we use several messages
>>> so
>>>     as to pipeline things: the first computation can start as soon as one
>>>     message gets received, thus overlapped with further receptions.
>>>
>>>     This problem is only with openmpi on TCP, I'm not getting this
>>>     behavior
>>>     with openmpi on IB, and I'm not getting it either with mpich or
>>>     madmpi:
>>>
>>>     0.182168: Isend 0 begin
>>>     0.182235: Isend 0 end
>>>     0.182237: Isend 1 begin
>>>     0.182242: Isend 1 end
>>>     ...
>>>     0.182842: Isend 49 begin
>>>     0.182844: Isend 49 end
>>>     0.200505: Irecv 0 begin
>>>     0.200564: Irecv 0 end
>>>     0.200567: Irecv 1 begin
>>>     0.200569: Irecv 1 end
>>>     ...
>>>     0.201233: Irecv 49 begin
>>>     0.201234: Irecv 49 end
>>>     0.269511: Isend 0 done!
>>>     0.273154: Irecv 0 done!
>>>     0.341054: Isend 1 done!
>>>     0.344507: Irecv 1 done!
>>>     ...
>>>     3.767726: Isend 49 done!
>>>     3.770637: Irecv 49 done!
>>>
>>>     There we do have pipelined reception.
>>>
>>>     Is there a way to get the second, pipelined behavior with openmpi on
>>>     TCP?
>>>
>>>     Samuel
>>>
>>>     _______________________________________________
>>>     users mailing list
>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to