On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>> As people have said, these time values are to be expected. All they
>> reflect is the time difference spent in reduce waiting for the slowest
>> process to catch up to everyone else. The barrier removes that factor
>> by forcing all processes to start from the same place.
>> 
>> 
>> No mystery here - just a reflection of the fact that your processes
>> arrive at the MPI_Reduce calls at different times.
> 
> 
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
> around, Gabriele?)
> 
> That's harder to explain as a sync issue.

Not really, you need some way of keeping processes in sync or else the slow 
ones get slower and the fast ones stay fast.  If you have an un-balanced 
algorithm then you can end up swamping certain ranks and when they get behind 
they get even slower and performance goes off a cliff edge.

Adding sporadic barriers keeps everything in sync and running nicely, if things 
are performing well then the barrier only slows things down but if there is a 
problem it'll bring all process back together and destroy the positive feedback 
cycle.  This is why you often only need a synchronisation point every so often, 
I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow 
operation, using asyncronous barriers you can allow small differences in timing 
but prevent them from getting too large with very little overhead in the common 
case where processes are synced already.  I'm thinking specifically of starting 
a sync-barrier on iteration N, waiting for it on N+25 and immediately starting 
another one, again waiting for it 25 steps later.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


Reply via email to