If barrier/time/barrier/time solves your problem in each measure, that means your computation above/below your barrier is not too "synchronized". Their overhead is diverse for each process. on 2nd/3rd/... round, the time to enter barrier is too diverse, maybe range from [1, 1400]. This Barrier becomes a huge overhead in your code.
Teng > do > > barrier/time/barrier/time > > and run your code again. > > Teng >> I will check that, but as I said in first email, this strange behaviour >> happens only in one place in my code. >> I have the same time/barrier/time procedure in other places (in the same >> code) and it works perfectly. >> >> At one place I have the following output (sorted) >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 14.2 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 16.3 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 25.1 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 28.4 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 32.6 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 35.3 >> . >> . >> . >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 90.1 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 96.3 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 99.5 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3 >> >> but in the place that concerns me I have (sorted) >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1386.9 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1401.5 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1412.9 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1414.1 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1419.6 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1428.1 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1430.4 >> . >> . >> . >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1632.7 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1635.7 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1660.6 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1685.1 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1699.2 >> >> >> These are the same units... >> You see that in the first place, the "time" to "hit/wait/leave" can be >> very small compared to the last output... >> >> >> Le 8 sept. 2011 à 16:35, Teng Ma a écrit : >> >>> You'd better check process-core binding in your case. It looks to me >>> P0 >>> and P1 on the same node and P2 on another node, which makes ack to >>> P0/P1 >>> go through share memory and ack to P2 through networking. >>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet >>> latency is about 20-30 microsec. >>> >>> Just my guess...... >>> >>> Teng >>>> Thanks, >>>> >>>> I understand this but the delays that I measure are huge compared to a >>>> classical ack procedure... (1000x more) >>>> And this is repeatable: as far as I understand it, this shows that the >>>> network is not involved. >>>> >>>> Ghislain. >>>> >>>> >>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit : >>>> >>>>> I guess you forget to count the "leaving time"(fan-out). When >>>>> everyone >>>>> hits the barrier, it still needs "ack" to leave. And remember in >>>>> most >>>>> cases, leader process will send out "acks" in a sequence way. It's >>>>> very >>>>> possible: >>>>> >>>>> P0 barrier time = 29 + send/recv ack 0 >>>>> P1 barrier time = 14 + send ack 0 + send/recv ack 1 >>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2 >>>>> >>>>> That's your measure time. >>>>> >>>>> Teng >>>>>> This problem as nothing to do with stdout... >>>>>> >>>>>> Example with 3 processes: >>>>>> >>>>>> P0 hits barrier at t=12 >>>>>> P1 hits barrier at t=27 >>>>>> P2 hits barrier at t=41 >>>>>> >>>>>> In this situation: >>>>>> P0 waits 41-12 = 29 >>>>>> P1 waits 41-27 = 14 >>>>>> P2 waits 41-41 = 00 >>>>> >>>>> >>>>> >>>>>> So I should see something like (no ordering is expected): >>>>>> barrier_time = 14 >>>>>> barrier_time = 00 >>>>>> barrier_time = 29 >>>>>> >>>>>> But what I see is much more like >>>>>> barrier_time = 22 >>>>>> barrier_time = 29 >>>>>> barrier_time = 25 >>>>>> >>>>>> See? No process has a barrier_time equal to zero !!! >>>>>> >>>>>> >>>>>> >>>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit : >>>>>> >>>>>>> The order in which you see stdout printed from mpirun is not >>>>>>> necessarily >>>>>>> reflective of what order things were actually printers. Remember >>>>>>> that >>>>>>> the stdout from each MPI process needs to flow through at least 3 >>>>>>> processes and potentially across the network before it is actually >>>>>>> displayed on mpirun's stdout. >>>>>>> >>>>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to >>>>>>> mpirun's >>>>>>> stdout >>>>>>> >>>>>>> Hence, the ordering of stdout can get transposed. >>>>>>> >>>>>>> >>>>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote: >>>>>>> >>>>>>>> Thank you for this explanation but indeed this confirms that the >>>>>>>> LAST >>>>>>>> process that hits the barrier should go through nearly >>>>>>>> instantaneously >>>>>>>> (except for the broadcast time for the acknowledgment signal). >>>>>>>> And this is not what happens in my code : EVERY process waits for >>>>>>>> a >>>>>>>> very long time before going through the barrier (thousands of >>>>>>>> times >>>>>>>> more than a broadcast)... >>>>>>>> >>>>>>>> >>>>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit : >>>>>>>> >>>>>>>>> Order in which processes hit the barrier is only one factor in >>>>>>>>> the >>>>>>>>> time it takes for that process to finish the barrier. >>>>>>>>> >>>>>>>>> An easy way to think of a barrier implementation is a "fan in/fan >>>>>>>>> out" >>>>>>>>> model. When each nonzero rank process calls MPI_BARRIER, it >>>>>>>>> sends >>>>>>>>> a >>>>>>>>> message saying "I have hit the barrier!" (it usually sends it to >>>>>>>>> its >>>>>>>>> parent in a tree of all MPI processes in the communicator, but >>>>>>>>> you >>>>>>>>> can >>>>>>>>> simplify this model and consider that it sends it to rank 0). >>>>>>>>> Rank >>>>>>>>> 0 >>>>>>>>> collects all of these messages. When it has messages from all >>>>>>>>> processes in the communicator, it sends out "ok, you can leave >>>>>>>>> the >>>>>>>>> barrier now" messages (again, it's usually via a tree >>>>>>>>> distribution, >>>>>>>>> but you can pretend that it directly, linearly sends a message to >>>>>>>>> each >>>>>>>>> peer process in the communicator). >>>>>>>>> >>>>>>>>> Hence, the time that any individual process spends in the >>>>>>>>> communicator >>>>>>>>> is relative to when every other process enters the communicator. >>>>>>>>> But >>>>>>>>> it's also dependent upon communication speed, congestion in the >>>>>>>>> network, etc. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> at a given point in my (Fortran90) program, I write: >>>>>>>>>> >>>>>>>>>> =================== >>>>>>>>>> start_time = MPI_Wtime() >>>>>>>>>> call MPI_BARRIER(...) >>>>>>>>>> new_time = MPI_Wtime() - start_time >>>>>>>>>> write(*,*) "barrier time =",new_time >>>>>>>>>> ================== >>>>>>>>>> >>>>>>>>>> and then I run my code... >>>>>>>>>> >>>>>>>>>> I expected that the values of "new_time" would range from 0 to >>>>>>>>>> Tmax >>>>>>>>>> (1700 in my case) >>>>>>>>>> As I understand it, the first process that hits the barrier >>>>>>>>>> should >>>>>>>>>> print Tmax and the last process that hits the barrier should >>>>>>>>>> print >>>>>>>>>> 0 >>>>>>>>>> (or a very low value). >>>>>>>>>> >>>>>>>>>> But this is not the case: all processes print values in the >>>>>>>>>> range >>>>>>>>>> 1400-1700! >>>>>>>>>> >>>>>>>>>> Any explanation? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ghislain. >>>>>>>>>> >>>>>>>>>> PS: >>>>>>>>>> This small code behaves perfectly in other parts of my code... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> jsquy...@cisco.com >>>>>>>>> For corporate legal information go to: >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>>> | Teng Ma Univ. of Tennessee | >>>>> | t...@cs.utk.edu Knoxville, TN | >>>>> | http://web.eecs.utk.edu/~tma/ | >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> | Teng Ma Univ. of Tennessee | >>> | t...@cs.utk.edu Knoxville, TN | >>> | http://web.eecs.utk.edu/~tma/ | >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> _______________________________________________