Thanks George for the explanation, with the default eager size, the first message is received *after* the last message is sent, regardless the progress thread is used or not. an other way to put it is that MPI_Isend() (and probably MPI_Irecv() too) do not involve any progression, so i naively thought the progress thread would have helped here.
just to be 100% sure, could you please confirm this is the intended behavior and not a bug ? Cheers, Gilles On Sat, Jul 22, 2017 at 5:00 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > > On Thu, Jul 20, 2017 at 8:57 PM, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: >> >> Sam, >> >> >> this example is using 8 MB size messages >> >> if you are fine with using more memory, and your application should not >> generate too much unexpected messages, then you can bump the eager_limit >> for example >> >> mpirun --mca btl_tcp_eager_limit $((8*1024*1024+128)) ... >> >> worked for me > > > Ah, interesting. If forcing a very large eager then the problem might be > coming from the pipelining algorithm. Not a good solution in general, but > handy to see what's going on. As many sends are available, the pipelining > might be overwhelmed and interleave fragments from different requests. Let > me dig a little bit here, I think I know exactly what is going on. > >> >> George, >> >> in master, i thought >> >> mpirun --mca btl_tcp_progress_thread 1 ... >> >> would help but it did not. >> did i misunderstand the purpose of the TCP progress thread ? > > > Gilles, > > In this example most of the time is spent in an MPI_* function (mainly the > MPI_Wait), so the progress thread has little opportunity to help. The role > of the progress thread is to make sure communications are progressed when > the application is not into an MPI call. > > George. > > > >> >> >> Cheers, >> >> Gilles >> >> On 7/21/2017 9:05 AM, George Bosilca wrote: >>> >>> Sam, >>> >>> Open MPI aggregates messages only when network constraints prevent the >>> messages from being timely delivered. In this particular case I think that >>> our delayed business card exchange and connection setup is delaying the >>> delivery of the first batch of messages (and the BTL will aggregate them >>> while waiting for the connection to be correctly setup). >>> >>> Can you reproduce the same behavior after the first batch of messages ? >>> >>> Assuming the times showed on the left of your messages are correct, the >>> first MPI seems to deliver the entire set of messages significantly faster >>> than the second. >>> >>> George. >>> >>> >>> >>> >>> >>> On Thu, Jul 20, 2017 at 5:42 PM, Samuel Thibault >>> <samuel.thiba...@labri.fr <mailto:samuel.thiba...@labri.fr>> wrote: >>> >>> Hello, >>> >>> We are getting a strong performance issue, which is due to a missing >>> pipelining behavior from OpenMPI when running over TCP. I have >>> attached >>> a test case. Basically what it does is >>> >>> if (myrank == 0) { >>> for (i = 0; i < N; i++) >>> MPI_Isend(...); >>> } else { >>> for (i = 0; i < N; i++) >>> MPI_Irecv(...); >>> } >>> for (i = 0; i < N; i++) >>> MPI_Wait(...); >>> >>> with corresponding printfs. And the result is: >>> >>> 0.182620: Isend 0 begin >>> 0.182761: Isend 0 end >>> 0.182766: Isend 1 begin >>> 0.182782: Isend 1 end >>> ... >>> 0.183911: Isend 49 begin >>> 0.183915: Isend 49 end >>> 0.199028: Irecv 0 begin >>> 0.199068: Irecv 0 end >>> 0.199070: Irecv 1 begin >>> 0.199072: Irecv 1 end >>> ... >>> 0.199187: Irecv 49 begin >>> 0.199188: Irecv 49 end >>> 0.233948: Isend 0 done! >>> 0.269895: Isend 1 done! >>> ... >>> 1.982475: Isend 49 done! >>> 1.984065: Irecv 0 done! >>> 1.984078: Irecv 1 done! >>> ... >>> 1.984131: Irecv 49 done! >>> >>> i.e. almost two seconds happen between the start of the >>> application and >>> the first Irecv completes, and then all Irecv complete immediately >>> too, >>> i.e. it seems the communications were grouped altogether. >>> >>> This is really bad, because in our real use case, we trigger >>> computations after each MPI_Wait calls, and we use several messages >>> so >>> as to pipeline things: the first computation can start as soon as one >>> message gets received, thus overlapped with further receptions. >>> >>> This problem is only with openmpi on TCP, I'm not getting this >>> behavior >>> with openmpi on IB, and I'm not getting it either with mpich or >>> madmpi: >>> >>> 0.182168: Isend 0 begin >>> 0.182235: Isend 0 end >>> 0.182237: Isend 1 begin >>> 0.182242: Isend 1 end >>> ... >>> 0.182842: Isend 49 begin >>> 0.182844: Isend 49 end >>> 0.200505: Irecv 0 begin >>> 0.200564: Irecv 0 end >>> 0.200567: Irecv 1 begin >>> 0.200569: Irecv 1 end >>> ... >>> 0.201233: Irecv 49 begin >>> 0.201234: Irecv 49 end >>> 0.269511: Isend 0 done! >>> 0.273154: Irecv 0 done! >>> 0.341054: Isend 1 done! >>> 0.344507: Irecv 1 done! >>> ... >>> 3.767726: Isend 49 done! >>> 3.770637: Irecv 49 done! >>> >>> There we do have pipelined reception. >>> >>> Is there a way to get the second, pipelined behavior with openmpi on >>> TCP? >>> >>> Samuel >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users