On Thu, Jul 20, 2017 at 8:57 PM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> Sam,
>
>
> this example is using 8 MB size messages
>
> if you are fine with using more memory, and your application should not
> generate too much unexpected messages, then you can bump the eager_limit
> for example
>
> mpirun --mca btl_tcp_eager_limit $((8*1024*1024+128)) ...
>
> worked for me
>

Ah, interesting. If forcing a very large eager then the problem might be
coming from the pipelining algorithm. Not a good solution in general, but
handy to see what's going on. As many sends are available, the pipelining
might be overwhelmed and interleave fragments from different requests. Let
me dig a little bit here, I think I know exactly what is going on.


> George,
>
> in master, i thought
>
> mpirun --mca btl_tcp_progress_thread 1 ...
>
> would help but it did not.
> did i misunderstand the purpose of the TCP progress thread ?
>

Gilles,

In this example most of the time is spent in an MPI_* function (mainly the
MPI_Wait), so the progress thread has little opportunity to help. The role
of the progress thread is to make sure communications are progressed when
the application is not into an MPI call.

  George.




>
> Cheers,
>
> Gilles
>
> On 7/21/2017 9:05 AM, George Bosilca wrote:
>
>> Sam,
>>
>> Open MPI aggregates messages only when network constraints prevent the
>> messages from being timely delivered. In this particular case I think that
>> our delayed business card exchange and connection setup is delaying the
>> delivery of the first batch of messages (and the BTL will aggregate them
>> while waiting for the connection to be correctly setup).
>>
>> Can you reproduce the same behavior after the first batch of messages ?
>>
>> Assuming the times showed on the left of your messages are correct, the
>> first MPI seems to deliver the entire set of messages significantly faster
>> than the second.
>>
>>   George.
>>
>>
>>
>>
>>
>> On Thu, Jul 20, 2017 at 5:42 PM, Samuel Thibault <
>> samuel.thiba...@labri.fr <mailto:samuel.thiba...@labri.fr>> wrote:
>>
>>     Hello,
>>
>>     We are getting a strong performance issue, which is due to a missing
>>     pipelining behavior from OpenMPI when running over TCP. I have
>>     attached
>>     a test case. Basically what it does is
>>
>>     if (myrank == 0) {
>>             for (i = 0; i < N; i++)
>>                     MPI_Isend(...);
>>     } else {
>>             for (i = 0; i < N; i++)
>>                     MPI_Irecv(...);
>>     }
>>     for (i = 0; i < N; i++)
>>             MPI_Wait(...);
>>
>>     with corresponding printfs. And the result is:
>>
>>     0.182620: Isend 0 begin
>>     0.182761: Isend 0 end
>>     0.182766: Isend 1 begin
>>     0.182782: Isend 1 end
>>     ...
>>     0.183911: Isend 49 begin
>>     0.183915: Isend 49 end
>>     0.199028: Irecv 0 begin
>>     0.199068: Irecv 0 end
>>     0.199070: Irecv 1 begin
>>     0.199072: Irecv 1 end
>>     ...
>>     0.199187: Irecv 49 begin
>>     0.199188: Irecv 49 end
>>     0.233948: Isend 0 done!
>>     0.269895: Isend 1 done!
>>     ...
>>     1.982475: Isend 49 done!
>>     1.984065: Irecv 0 done!
>>     1.984078: Irecv 1 done!
>>     ...
>>     1.984131: Irecv 49 done!
>>
>>     i.e. almost two seconds happen between the start of the
>>     application and
>>     the first Irecv completes, and then all Irecv complete immediately
>>     too,
>>     i.e. it seems the communications were grouped altogether.
>>
>>     This is really bad, because in our real use case, we trigger
>>     computations after each MPI_Wait calls, and we use several messages so
>>     as to pipeline things: the first computation can start as soon as one
>>     message gets received, thus overlapped with further receptions.
>>
>>     This problem is only with openmpi on TCP, I'm not getting this
>>     behavior
>>     with openmpi on IB, and I'm not getting it either with mpich or
>>     madmpi:
>>
>>     0.182168: Isend 0 begin
>>     0.182235: Isend 0 end
>>     0.182237: Isend 1 begin
>>     0.182242: Isend 1 end
>>     ...
>>     0.182842: Isend 49 begin
>>     0.182844: Isend 49 end
>>     0.200505: Irecv 0 begin
>>     0.200564: Irecv 0 end
>>     0.200567: Irecv 1 begin
>>     0.200569: Irecv 1 end
>>     ...
>>     0.201233: Irecv 49 begin
>>     0.201234: Irecv 49 end
>>     0.269511: Isend 0 done!
>>     0.273154: Irecv 0 done!
>>     0.341054: Isend 1 done!
>>     0.344507: Irecv 1 done!
>>     ...
>>     3.767726: Isend 49 done!
>>     3.770637: Irecv 49 done!
>>
>>     There we do have pipelined reception.
>>
>>     Is there a way to get the second, pipelined behavior with openmpi on
>>     TCP?
>>
>>     Samuel
>>
>>     _______________________________________________
>>     users mailing list
>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>     <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to