If barrier/time/barrier/time solves your problem in each measure, that
means your computation above/below your barrier is not too "synchronized".
Their overhead is diverse for each process.  on 2nd/3rd/... round, the
time to enter barrier is too diverse, maybe range from [1, 1400]. This
Barrier becomes a huge overhead in your code.

Teng

> do
>
> barrier/time/barrier/time
>
> and run your code again.
>
> Teng
>> I will check that, but as I said in first email, this strange behaviour
>> happens only in one place in my code.
>> I have the same time/barrier/time procedure in other places (in the same
>> code) and it works perfectly.
>>
>> At one place I have the following output (sorted)
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier       1.0
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier       1.0
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      14.2
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      16.3
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      25.1
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      28.4
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      32.6
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      35.3
>> .
>> .
>> .
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      90.1
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      96.3
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier      99.5
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier     101.2
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier     119.3
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier     169.3
>>
>> but in the place that concerns me I have (sorted)
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1386.9
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1401.5
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1412.9
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1414.1
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1419.6
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1428.1
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1430.4
>> .
>> .
>> .
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1632.7
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1635.7
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1660.6
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1685.1
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier    1699.2
>>
>>
>> These are the same units...
>> You see that in the first place, the "time" to "hit/wait/leave" can be
>> very small compared to the last output...
>>
>>
>> Le 8 sept. 2011 à 16:35, Teng Ma a écrit :
>>
>>> You'd better check process-core binding in your case.  It looks to me
>>> P0
>>> and P1 on the same node and P2 on another node, which makes ack to
>>> P0/P1
>>> go through share memory and ack to P2 through networking.
>>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet
>>> latency is about 20-30 microsec.
>>>
>>> Just my guess......
>>>
>>> Teng
>>>> Thanks,
>>>>
>>>> I understand this but the delays that I measure are huge compared to a
>>>> classical ack procedure... (1000x more)
>>>> And this is repeatable: as far as I understand it, this shows that the
>>>> network is not involved.
>>>>
>>>> Ghislain.
>>>>
>>>>
>>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>>>>
>>>>> I guess you forget to count the "leaving time"(fan-out).  When
>>>>> everyone
>>>>> hits the barrier, it still needs "ack" to leave.  And remember in
>>>>> most
>>>>> cases, leader process will send out "acks" in a sequence way.  It's
>>>>> very
>>>>> possible:
>>>>>
>>>>> P0 barrier time = 29 + send/recv ack 0
>>>>> P1 barrier time = 14 + send ack 0  + send/recv ack 1
>>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>>>>
>>>>> That's your measure time.
>>>>>
>>>>> Teng
>>>>>> This problem as nothing to do with stdout...
>>>>>>
>>>>>> Example with 3 processes:
>>>>>>
>>>>>> P0 hits barrier at t=12
>>>>>> P1 hits barrier at t=27
>>>>>> P2 hits barrier at t=41
>>>>>>
>>>>>> In this situation:
>>>>>> P0 waits 41-12 = 29
>>>>>> P1 waits 41-27 = 14
>>>>>> P2 waits 41-41 = 00
>>>>>
>>>>>
>>>>>
>>>>>> So I should see something  like (no ordering is expected):
>>>>>> barrier_time = 14
>>>>>> barrier_time = 00
>>>>>> barrier_time = 29
>>>>>>
>>>>>> But what I see is much more like
>>>>>> barrier_time = 22
>>>>>> barrier_time = 29
>>>>>> barrier_time = 25
>>>>>>
>>>>>> See? No process has a barrier_time equal to zero !!!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>>>>
>>>>>>> The order in which you see stdout printed from mpirun is not
>>>>>>> necessarily
>>>>>>> reflective of what order things were actually printers.  Remember
>>>>>>> that
>>>>>>> the stdout from each MPI process needs to flow through at least 3
>>>>>>> processes and potentially across the network before it is actually
>>>>>>> displayed on mpirun's stdout.
>>>>>>>
>>>>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to
>>>>>>> mpirun's
>>>>>>> stdout
>>>>>>>
>>>>>>> Hence, the ordering of stdout can get transposed.
>>>>>>>
>>>>>>>
>>>>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>>>>>
>>>>>>>> Thank you for this explanation but indeed this confirms that the
>>>>>>>> LAST
>>>>>>>> process that hits the barrier should go through nearly
>>>>>>>> instantaneously
>>>>>>>> (except for the broadcast time for the acknowledgment signal).
>>>>>>>> And this is not what happens in my code : EVERY process waits for
>>>>>>>> a
>>>>>>>> very long time before going through the barrier (thousands of
>>>>>>>> times
>>>>>>>> more than a broadcast)...
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>>>>>
>>>>>>>>> Order in which processes hit the barrier is only one factor in
>>>>>>>>> the
>>>>>>>>> time it takes for that process to finish the barrier.
>>>>>>>>>
>>>>>>>>> An easy way to think of a barrier implementation is a "fan in/fan
>>>>>>>>> out"
>>>>>>>>> model.  When each nonzero rank process calls MPI_BARRIER, it
>>>>>>>>> sends
>>>>>>>>> a
>>>>>>>>> message saying "I have hit the barrier!" (it usually sends it to
>>>>>>>>> its
>>>>>>>>> parent in a tree of all MPI processes in the communicator, but
>>>>>>>>> you
>>>>>>>>> can
>>>>>>>>> simplify this model and consider that it sends it to rank 0).
>>>>>>>>> Rank
>>>>>>>>> 0
>>>>>>>>> collects all of these messages.  When it has messages from all
>>>>>>>>> processes in the communicator, it sends out "ok, you can leave
>>>>>>>>> the
>>>>>>>>> barrier now" messages (again, it's usually via a tree
>>>>>>>>> distribution,
>>>>>>>>> but you can pretend that it directly, linearly sends a message to
>>>>>>>>> each
>>>>>>>>> peer process in the communicator).
>>>>>>>>>
>>>>>>>>> Hence, the time that any individual process spends in the
>>>>>>>>> communicator
>>>>>>>>> is relative to when every other process enters the communicator.
>>>>>>>>> But
>>>>>>>>> it's also dependent upon communication speed, congestion in the
>>>>>>>>> network, etc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> at a given point in my (Fortran90) program, I write:
>>>>>>>>>>
>>>>>>>>>> ===================
>>>>>>>>>> start_time = MPI_Wtime()
>>>>>>>>>> call MPI_BARRIER(...)
>>>>>>>>>> new_time = MPI_Wtime() - start_time
>>>>>>>>>> write(*,*) "barrier time =",new_time
>>>>>>>>>> ==================
>>>>>>>>>>
>>>>>>>>>> and then I run my code...
>>>>>>>>>>
>>>>>>>>>> I expected that the values of "new_time" would range from 0 to
>>>>>>>>>> Tmax
>>>>>>>>>> (1700 in my case)
>>>>>>>>>> As I understand it, the first process that hits the barrier
>>>>>>>>>> should
>>>>>>>>>> print Tmax and the last process that hits the barrier should
>>>>>>>>>> print
>>>>>>>>>> 0
>>>>>>>>>> (or a very low value).
>>>>>>>>>>
>>>>>>>>>> But this is not the case: all processes print values in the
>>>>>>>>>> range
>>>>>>>>>> 1400-1700!
>>>>>>>>>>
>>>>>>>>>> Any explanation?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ghislain.
>>>>>>>>>>
>>>>>>>>>> PS:
>>>>>>>>>> This small code behaves perfectly in other parts of my code...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Squyres
>>>>>>>>> jsquy...@cisco.com
>>>>>>>>> For corporate legal information go to:
>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> jsquy...@cisco.com
>>>>>>> For corporate legal information go to:
>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> | Teng Ma          Univ. of Tennessee |
>>>>> | t...@cs.utk.edu        Knoxville, TN |
>>>>> | http://web.eecs.utk.edu/~tma/       |
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> | Teng Ma          Univ. of Tennessee |
>>> | t...@cs.utk.edu        Knoxville, TN |
>>> | http://web.eecs.utk.edu/~tma/       |
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________

Reply via email to