Re: [OMPI users] MPI_Reduce performance

2010-09-22 Thread Jeff Squyres
On Sep 9, 2010, at 4:31 PM, Ashley Pittman wrote:

>> What is the exact semantics of an asynchronous barrier,
> 
> I'm not sure of the exact semantics but once you've got your head around the 
> concept it's fairly simple to understand how to use it, you call 
> MPI_IBarrier() and it gives you a handle you can test with MPI_Test() or 
> block for with MPI_Wait().  The tricky part comes in how many times you can 
> call MPI_IBarrier() on a communicator without waiting for the previous 
> barriers to complete but I haven't been following the discussion on this one 
> to know the specifics.

FWIW: here's a brief writeup of MPI_Ibarrier -

http://blogs.cisco.com/ciscotalk/performance/comments/ibarrier/

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_Reduce performance

2010-09-10 Thread Richard Treumann
The difference is that MPI_Barrier or even MPI_Ibarrier is a crude tool 
for the job. It is likely to leave resources idle that could be doing 
productive work.

I agree it is part of the tool kit but if this kind of problem is 
significant enough so a textbook should cover it then I would advocate 
that people be taught where to look for the problem and to consider which 
tool to use. 

The first step is to look for way to improve load balance.  The second 
step is to look for ways to impede as few processes as possible and only 
when they must be impeded. 

I am also suggesting that there is a middle class of application that 
improves with semantically redundant barrier. Applications with a 
significant design flaw that it would be better to find and fix than to 
mask.

The MPI standard provides for a flow control to prevent eager message 
flooding but it does not provide a flow control for descriptor flooding. 
Descriptor flooding may even bring down a job and the standard does not 
oblige the MPI implementation to prevent that.  This problem was 
understood from the beginning but it cannot be solved within the semantic 
rules for non-blocking Send/Recv. The options were to give the problem to 
the application writer or to say "A non-blocking operation will return 
promptly unless there is a resource shortage".

  Dick 


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 09/10/2010 10:27:02 AM:

> [image removed] 
> 
> Re: [OMPI users] MPI_Reduce performance
> 
> Eugene Loh 
> 
> to:
> 
> Open MPI Users
> 
> 09/10/2010 10:30 AM
> 
> Sent by:
> 
> users-boun...@open-mpi.org
> 
> Please respond to Open MPI Users
> 
> Richard Treumann wrote: 
> 
> Hi Ashley 
> 
> I understand the problem with descriptor flooding can be serious in 
> an application with unidirectional data dependancy. Perhaps we have 
> a different perception of how common that is. 
> Ashley speculated it was a "significant minority."  I don't know 
> what that means, but it seems like it is a minority (most 
> computations have causal relationships among the processes holding 
> unbounded imbalances in check) and yet we end up seeing these 
exceptions.
> I think that adding some flow control to the application is a better
> solution than semantically redundant barrier.
> It seems to me there is no difference.  Flow control, at this level,
> is just semantically redundant synchronization.  A barrier is just a
> special case of that.___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Reduce performance

2010-09-10 Thread Eugene Loh




Richard Treumann wrote:

  Hi Ashley
  
  
  I understand the problem with
descriptor
flooding can be serious in an application with unidirectional data
dependancy.
Perhaps we have a different perception of how common that is.
  

Ashley speculated it was a "significant minority."  I don't know what
that means, but it seems like it is a minority (most computations have
causal relationships among the processes holding unbounded imbalances
in check) and yet we end up seeing these exceptions.
I think that
adding some flow control to the application is a better solution than
semantically
redundant barrier.
It seems to me there is no difference.  Flow control, at this level, is
just semantically redundant synchronization.  A barrier is just a
special case of that.




Re: [OMPI users] MPI_Reduce performance

2010-09-10 Thread Richard Treumann
Hi Ashley

I understand the problem with descriptor flooding can be serious in an 
application with unidirectional data dependancy. Perhaps we have a 
different perception of how common that is.

It seems to me that such programs would be very rare but if they are more 
common than I imagine, then discussion of how to modulate them is 
worthwhile.  In many cases, I think that adding some flow control to the 
application is a better solution than semantically redundant barrier. (A 
barrier that is there only to affect performance, not correctness, is what 
I mean by semantically redundant)

For example, a Master/Worker application could have each worker break 
after every 4th send to the master and post an MPI_Recv for an 
OK_to_continue token.  If the token had been sent, this would delay the 
worker a few microseconds. If it had not been sent, the worker would be 
kept waiting.

The Master would keep track of how many messages from each worker it had 
absorbed and on message 3 from a particular worker, send an OK_to_continue 
token to that worker.  The master would keep sending OK_to_continue tokens 
every 4th recv from then on (7, 11, 15 ...)  The descriptor queues would 
all remain short and only a worker that the master could not keep up with 
would ever lose a chance to keep working.  By sending the OK_to_continue 
token a bit early, the application would ensure that when there was no 
backlog, every worker would find a token when it looked for it and there 
would be no significant loss of compute time.

Even with non-blocking barrier and a 10 step lag between Ibarrier and 
Wait, , if some worker is slow for 12 steps, the fast workers end up being 
kept in a non-productive MPI_Wait.

  Dick 


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 09/09/2010 05:34:15 PM:

> [image removed] 
> 
> Re: [OMPI users] MPI_Reduce performance
> 
> Ashley Pittman 
> 
> to:
> 
> Open MPI Users
> 
> 09/09/2010 05:37 PM
> 
> Sent by:
> 
> users-boun...@open-mpi.org
> 
> Please respond to Open MPI Users
> 
> 
> On 9 Sep 2010, at 21:40, Richard Treumann wrote:
> 
> > 
> > Ashley 
> > 
> > Can you provide an example of a situation in which these 
> semantically redundant barriers help? 
> 
> I'm not making the case for semantically redundant barriers, I'm 
> making a case for implicit synchronisation in every iteration of a 
> application.  Many applications have this already by nature of the 
> data-flow required, anything that calls mpi_allgather or 
> mpi_allreduce are the easiest to verify but there are many other 
> ways of achieving the same thing.  My point is about the subset of 
> programs which don't have this attribute and are therefore 
> susceptible to synchronisation problems.  It's my experience that 
> for low iteration counts these codes can run fine but once they hit 
> a problem they go over a cliff edge performance wise and there is no
> way back from there until the end of the job.  The email from 
> Gabriele would appear to be a case that demonstrates this problem 
> but I've seen it many times before.
> 
> Using your previous email as an example I would describe adding 
> barriers to a problem as a way artificially reducing the 
> "elasticity" of the program to ensure balanced use of resources.
> 
> > I may be missing something but my statement for the text book would be 

> > 
> > "If adding a barrier to your MPI program makes it run faster, 
> there is almost certainly a flaw in it that is better solved another 
way." 
> > 
> > The only exception I can think of is some sort of one direction 
> data dependancy with messages small enough to go eagerly.  A program
> that calls MPI_Reduce with a small message and the same root every 
> iteration and  calls no other collective would be an example. 
> > 
> > In that case, fast tasks at leaf positions would run free and a 
> slow task near the root could pile up early arrivals and end up with
> some additional slowing. Unless it was driven into paging I cannot 
> imagine the slowdown would be significant though. 
> 
> I've diagnosed problems where the cause was a receive queue of tens 
> of thousands of messages, in this case each and every receive 
> performs slowly unless the descriptor is near the front of the queue
> so the concern is not purely about memory usage at individual 
> processes although that can also be a factor.
> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Alex A. Granovsky
Eugene,

you did not take into account the dispersion/dephasing between different 
processes. As cluster size and the 
number of instances of parallel process increase, the dispersion increases as 
well, making different instances 
to be a kind out of sync - not really out of sync, but just because of 
different speed of execution on different nodes, delays, etc... 
If you account for this, you get the result I mentioned.

Alex

  - Original Message - 
  From: Eugene Loh 
  To: Open MPI Users 
  Sent: Thursday, September 09, 2010 11:32 PM
  Subject: Re: [OMPI users] MPI_Reduce performance


  Alex A. Granovsky wrote: 
Isn't in evident from the theory of random processes and probability theory 
that in the limit of infinitely 
large cluster and parallel process, the probability of deadlocks with 
current implementation is unfortunately 
quite a finite quantity and in limit approaches to unity regardless on any 
particular details of the program.
  No, not at all.  Consider simulating a physical volume.  Each process is 
assigned to some small subvolume.  It updates conditions locally, but on the 
surface of its simulation subvolume it needs information from "nearby" 
processes.  It cannot proceed along the surface until it has that neighboring 
information.  Its neighbors, in turn, cannot proceed until their neighbors have 
reached some point.  Two distant processes can be quite out of step with one 
another, but only by some bounded amount.  At some point, a leading process has 
to wait for information from a laggard to propagate to it.  All processes 
proceed together, in some loose lock-step fashion.  Many applications behave in 
this fashion.  Actually, in many applications, the synchronization is tightened 
in that "physics" is made to propagate faster than neighbor-by-neighbor.

  As the number of processes increases, the laggard might seem relatively 
slower in comparison, but that isn't deadlock.

  As the size of the cluster increases, the chances of a system component 
failure increase, but that also is a different matter.



--


  ___
  users mailing list
  us...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Richard Treumann
Ashley

Can you provide an example of a situation in which these semantically 
redundant barriers help?

I may be missing something but my statement for the text book would be 

"If adding a barrier to your MPI program makes it run faster, there is 
almost certainly a flaw in it that is better solved another way."

The only exception I can think of is some sort of one direction data 
dependancy with messages small enough to go eagerly.  A program that calls 
MPI_Reduce with a small message and the same root every iteration and 
calls no other collective would be an example.

In that case, fast tasks at leaf positions would run free and a slow task 
near the root could pile up early arrivals and end up with some additional 
slowing. Unless it was driven into paging I cannot imagine the slowdown 
would be significant though.

 Even that should not be a problem for an MPI implementation that backs 
off on eager send before it floods early arrival buffers.


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 21:10, jody wrote:

> Hi
> @Ashley:
> What is the exact semantics of an asynchronous barrier,

I'm not sure of the exact semantics but once you've got your head around the 
concept it's fairly simple to understand how to use it, you call MPI_IBarrier() 
and it gives you a handle you can test with MPI_Test() or block for with 
MPI_Wait().  The tricky part comes in how many times you can call 
MPI_IBarrier() on a communicator without waiting for the previous barriers to 
complete but I haven't been following the discussion on this one to know the 
specifics.

> and is it part of the MPI specs?

It will be a part of the next revision of the standard I believe.  It's been a 
long time coming and there is at least one implementation out there already but 
I can't comment on it's usability today.  To be clear it's something I've long 
advocated and have implemented and played around with in the past however it's 
not yet available to users today but I believe it will be shortly and as you'll 
have read my believe is it's going to be a very useful addition to the MPI 
offering.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread jody
Hi
@Ashley:
What is the exact semantics of an asynchronous barrier,
and is it part of the MPI specs?

Thanks
  Jody

On Thu, Sep 9, 2010 at 9:34 PM, Ashley Pittman  wrote:
>
> On 9 Sep 2010, at 17:00, Gus Correa wrote:
>
>> Hello All
>>
>> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary 
>> words, may be part of a larger context of load balance, or not?
>>
>> Would Ashley's recipe of sporadic barriers be a silver bullet to
>> improve load imbalance problems, regardless of which collectives or
>> even point-to-point calls are in use?
>
> No, it only holds where there is no data dependency between some of the 
> ranks, in particular if there are any non-rooted collectives in an iteration 
> of your code then it cannot make any difference at all, likewise if you have 
> a reduce followed by a barrier using the same root for example then you 
> already have global synchronisation each iteration and it won't help.  My 
> feeling is that it applies to a significant minority of problems, certainly 
> the phrase "adding barriers can make codes faster" should be textbook stuff 
> if it isn't already.
>
>> Would sporadic barriers in the flux coupler "shake up" these delays?
>
> I don't fully understand your description but it sounds like it might set the 
> program back to a clean slate which would give you per-iteraion delays only 
> rather than cumulative or worse delays.
>
>> Ashley:  How did you get to the magic number of 25 iterations for the
>> sporadic barriers?
>
> Experience and finger in the air.  The major factors in picking this number 
> is the likelihood of a positives feedback cycle of delays happening, the 
> delays these delays add and the cost of a barrier itself.  Having too low a 
> value will slightly reduce performance, having too high a value can 
> drastically reduce performance.
>
> As a further item (because I like them) the asynchronous barrier is even 
> better again if used properly, in the good case it doesn't cause any process 
> to block ever so the cost is only that of the CPU cycles the code takes 
> itself, in the bad case where it has to delay a rank then this tends to have 
> a positive impact on performance.
>
>> Would it be application/communicator pattern dependent?
>
> Absolutely.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 17:00, Gus Correa wrote:

> Hello All
> 
> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, 
> may be part of a larger context of load balance, or not?
> 
> Would Ashley's recipe of sporadic barriers be a silver bullet to
> improve load imbalance problems, regardless of which collectives or
> even point-to-point calls are in use?

No, it only holds where there is no data dependency between some of the ranks, 
in particular if there are any non-rooted collectives in an iteration of your 
code then it cannot make any difference at all, likewise if you have a reduce 
followed by a barrier using the same root for example then you already have 
global synchronisation each iteration and it won't help.  My feeling is that it 
applies to a significant minority of problems, certainly the phrase "adding 
barriers can make codes faster" should be textbook stuff if it isn't already.

> Would sporadic barriers in the flux coupler "shake up" these delays?

I don't fully understand your description but it sounds like it might set the 
program back to a clean slate which would give you per-iteraion delays only 
rather than cumulative or worse delays.

> Ashley:  How did you get to the magic number of 25 iterations for the
> sporadic barriers?

Experience and finger in the air.  The major factors in picking this number is 
the likelihood of a positives feedback cycle of delays happening, the delays 
these delays add and the cost of a barrier itself.  Having too low a value will 
slightly reduce performance, having too high a value can drastically reduce 
performance.

As a further item (because I like them) the asynchronous barrier is even better 
again if used properly, in the good case it doesn't cause any process to block 
ever so the cost is only that of the CPU cycles the code takes itself, in the 
bad case where it has to delay a rank then this tends to have a positive impact 
on performance.

> Would it be application/communicator pattern dependent?

Absolutely.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Eugene Loh




Alex A. Granovsky wrote:

  
  
  
  
  Isn't in
evident from the theory of random processes and probability theory that in the limit of infinitely 
  large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately 
  quite a finite quantity and in limit
approaches to unity regardless on any particular details of the program.
  

No, not at all.  Consider simulating a physical volume.  Each process
is assigned to some small subvolume.  It updates conditions locally,
but on the surface of its simulation subvolume it needs information
from "nearby" processes.  It cannot proceed along the surface until it
has that neighboring information.  Its neighbors, in turn, cannot
proceed until their neighbors have reached some point.  Two distant
processes can be quite out of step with one another, but only by some
bounded amount.  At some point, a leading process has to wait for
information from a laggard to propagate to it.  All processes proceed
together, in some loose lock-step fashion.  Many applications behave in
this fashion.  Actually, in many applications, the synchronization is
tightened in that "physics" is made to propagate faster than
neighbor-by-neighbor.

As the number of processes increases, the laggard might seem relatively
slower in comparison, but that isn't deadlock.

As the size of the cluster increases, the chances of a system component
failure increase, but that also is a different matter.




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Alex A. Granovsky
Isn't in evident from the theory of random processes and probability theory 
that in the limit of infinitely 
large cluster and parallel process, the probability of deadlocks with current 
implementation is unfortunately 
quite a finite quantity and in limit approaches to unity regardless on any 
particular details of the program.

Just my two cents.
Alex Granovsky

  - Original Message - 
  From: Richard Treumann 
  To: Open MPI Users 
  Sent: Thursday, September 09, 2010 10:10 PM
  Subject: Re: [OMPI users] MPI_Reduce performance



  I was pointing out that most programs have some degree of elastic 
synchronization built in. Tasks (or groups or components in a coupled model) 
seldom only produce data.they also consume what other tasks produce and that 
limits the potential skew.   

  If step n for a task (or group or coupled component) depends on data produced 
by step n-1 in another task  (or group or coupled component)  then no task can 
be farther ahead of the task it depends on than one step.   If there are 2 
tasks that each need the others step n-1 result to compute step n then they can 
never get farther than one step out of synch.  If there were a rank ordered 
loop of  8 tasks so each one needs the output of the prior step on task ((me-1) 
 mod tasks) to compute then you can get more skew because if 
  task 5 gets stalled in step 3, 
  task 6 will finish step 3 and send results to 7 but stall on recv for step 4 
(lacking the end of step 3 send by task 5) 
  task 7 will finish step 4 and send results to 0  but stall on recv for step 5 
  task 0 will finish step 5 and send results to 1  but stall on recv for step 6 
  etc 

  In a 2D or 3D grid, the dependency is tighter so the possible skew is less. 
but it is still significant on a huge grid   In a program with frequent calls 
to MPI_Allreduce on COMM_WORLD, the skew is very limited. The available skew 
gets harder to predict as the interdependencies grow more complex. 

  I call this "elasticity" because the amount of stretch varies but, like a 
bungee cord or an waist band, only goes so far. Every parallel program has some 
degree of elasticity built into the way its parts interact. 

  I assume a coupler has some elasticity too. That is, ocean and atmosphere 
each model Monday and report in to coupler but neither can model Tuesday until 
they get some of the Monday results generated by the other. (I am pretending 
granularity is day by day)  Wouldn't the right level of synchronization among 
component result automatically form the data dependencies among them? 



  Dick Treumann  -  MPI Team   
  IBM Systems & Technology Group
  Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
  Tele (845) 433-7846 Fax (845) 433-8363



From:  Eugene Loh <eugene@oracle.com>  
To:  Open MPI Users <us...@open-mpi.org>  
Date:  09/09/2010 12:40 PM  
    Subject:  Re: [OMPI users] MPI_Reduce performance  
Sent by:  users-boun...@open-mpi.org 


--



  Gus Correa wrote:

  > More often than not some components lag behind (regardless of how
  > much you tune the number of processors assigned to each component),
  > slowing down the whole scheme.
  > The coupler must sit and wait for that late component,
  > the other components must sit and wait for the coupler,
  > and the (vicious) "positive feedback" cycle that
  > Ashley mentioned goes on and on.

  I think "sit and wait" is the "typical" scenario that Dick mentions.  
  Someone lags, so someone else has to wait.

  In contrast, the "feedback" cycle Ashley mentions is where someone lags 
  and someone else keeps racing ahead, pumping even more data at the 
  laggard, forcing the laggard ever further behind.
  ___
  users mailing list
  us...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/users





--


  ___
  users mailing list
  us...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Richard Treumann
I was pointing out that most programs have some degree of elastic 
synchronization built in. Tasks (or groups or components in a coupled 
model) seldom only produce data.they also consume what other tasks produce 
and that limits the potential skew. 

If step n for a task (or group or coupled component) depends on data 
produced by step n-1 in another task  (or group or coupled component) then 
no task can be farther ahead of the task it depends on than one step.   If 
there are 2 tasks that each need the others step n-1 result to compute 
step n then they can never get farther than one step out of synch.  If 
there were a rank ordered loop of  8 tasks so each one needs the output of 
the prior step on task ((me-1)  mod tasks) to compute then you can get 
more skew because if 
task 5 gets stalled in step 3,
task 6 will finish step 3 and send results to 7 but stall on recv for step 
4 (lacking the end of step 3 send by task 5)
task 7 will finish step 4 and send results to 0  but stall on recv for 
step 5
task 0 will finish step 5 and send results to 1  but stall on recv for 
step 6
etc

In a 2D or 3D grid, the dependency is tighter so the possible skew is 
less. but it is still significant on a huge grid   In a program with 
frequent calls to MPI_Allreduce on COMM_WORLD, the skew is very limited. 
The available skew gets harder to predict as the interdependencies grow 
more complex.

I call this "elasticity" because the amount of stretch varies but, like a 
bungee cord or an waist band, only goes so far. Every parallel program has 
some degree of elasticity built into the way its parts interact.

I assume a coupler has some elasticity too. That is, ocean and atmosphere 
each model Monday and report in to coupler but neither can model Tuesday 
until they get some of the Monday results generated by the other. (I am 
pretending granularity is day by day)  Wouldn't the right level of 
synchronization among component result automatically form the data 
dependencies among them?



Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Eugene Loh <eugene@oracle.com>
To:
Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:
09/09/2010 12:40 PM
Subject:
Re: [OMPI users] MPI_Reduce performance
Sent by:
users-boun...@open-mpi.org



Gus Correa wrote:

> More often than not some components lag behind (regardless of how
> much you tune the number of processors assigned to each component),
> slowing down the whole scheme.
> The coupler must sit and wait for that late component,
> the other components must sit and wait for the coupler,
> and the (vicious) "positive feedback" cycle that
> Ashley mentioned goes on and on.

I think "sit and wait" is the "typical" scenario that Dick mentions. 
Someone lags, so someone else has to wait.

In contrast, the "feedback" cycle Ashley mentions is where someone lags 
and someone else keeps racing ahead, pumping even more data at the 
laggard, forcing the laggard ever further behind.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Eugene Loh

Gus Correa wrote:


More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.


I think "sit and wait" is the "typical" scenario that Dick mentions.  
Someone lags, so someone else has to wait.


In contrast, the "feedback" cycle Ashley mentions is where someone lags 
and someone else keeps racing ahead, pumping even more data at the 
laggard, forcing the laggard ever further behind.


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Gus Correa

Hello All

Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary 
words, may be part of a larger context of load balance, or not?


Would Ashley's recipe of sporadic barriers be a silver bullet to
improve load imbalance problems, regardless of which collectives or
even point-to-point calls are in use?

I have in mind for instance our big climate models.
Some of them work in MPMD mode, where several executables
representing atmosphere, ocean, etc, have their own
communicators, but interact with each other indirectly,
coordinated by a flux coupler (within yet another communicator).
The coupler receives, merges, and sends data across the other (physical) 
components.

The components don't talk to each other:
the coupler is the broker, the master.
This structure may be used in other fields and areas of application,
I'd guess.

More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.

Would sporadic barriers in the flux coupler "shake up" these delays?

Ashley:  How did you get to the magic number of 25 iterations for the
sporadic barriers?
Would it be application/communicator pattern dependent?

Many thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Richard Treumann wrote:


Ashley's observation may apply to an application that iterates on many 
to one communication patterns. If the only collective used is 
MPI_Reduce, some non-root tasks can get ahead and keep pushing iteration 
results at tasks that are nearer the root. This could overload them and 
cause some extra slow down.  

In most parallel applications, there is some web of interdependency 
across tasks between iterations that keeps them roughly in step.  I find 
it hare to believe that there are many programs that need semantically 
redundant MPI_Barriers.


For example -

In a program that does neighbor communication, no task can get very far 
ahead of its neighbors.  It is possible for a task at one corner to be a 
a few steps ahead of one at the opposite corner but only a few steps. In 
this case though, the distant neighbor is not being affected by that 
task that is out ahead anyway. It is only affected by its immediate 
neighbors,


In a program that does an MPI_Bcast from root and an MPI_Reduce to root 
in each iteration, No task gets far ahead because the task that finished 
the Bcast early, just wait longer at the Reduce.


An program that makes a call to a non-rooted collective every iteration 
will stay in pretty tight synch.


Think carefully before tossing in either MPI_Barrier or some 
non-blocking barrier.  Unless MPI_Bcast or MPI_Reduce is the only 
collective you call, your problem is likely not progress skew..



Dick Treumann  -  MPI Team  
IBM Systems & Technology Group

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363



From:   Ashley Pittman <ash...@pittman.co.uk>
To: Open MPI Users <us...@open-mpi.org>
Date:   09/09/2010 03:53 AM
Subject:Re: [OMPI users] MPI_Reduce performance
Sent by:users-boun...@open-mpi.org







On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

 > On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
 >> As people have said, these time values are to be expected. All they
 >> reflect is the time difference spent in reduce waiting for the slowest
 >> process to catch up to everyone else. The barrier removes that factor
 >> by forcing all processes to start from the same place.
 >>
 >>
 >> No mystery here - just a reflection of the fact that your processes
 >> arrive at the MPI_Reduce calls at different times.
 >
 >
 > Yes, however, it seems Gabriele is saying the total execution time
 > *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
 > around, Gabriele?)
 >
 > That's harder to explain as a sync issue.

Not really, you need some way of keeping processes in sync or else the 
slow ones get slower and the fast ones stay fast.  If you have an 
un-balanced algorithm then you can end up swamping certain ranks and 
when they get behind they get even slower and performance goes off a 
cliff edge.


Adding sporadic barriers keeps everything in sync and running nicely, if 
things are performing well then the barrier only slows things down but 
if there is a problem it'll bring all process back togeth

Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Richard Treumann
Ashley's observation may apply to an application that iterates on many to 
one communication patterns. If the only collective used is MPI_Reduce, 
some non-root tasks can get ahead and keep pushing iteration results at 
tasks that are nearer the root. This could overload them and cause some 
extra slow down. 

In most parallel applications, there is some web of interdependency across 
tasks between iterations that keeps them roughly in step.  I find it hare 
to believe that there are many programs that need semantically redundant 
MPI_Barriers.

For example -

In a program that does neighbor communication, no task can get very far 
ahead of its neighbors.  It is possible for a task at one corner to be a a 
few steps ahead of one at the opposite corner but only a few steps. In 
this case though, the distant neighbor is not being affected by that task 
that is out ahead anyway. It is only affected by its immediate neighbors,

In a program that does an MPI_Bcast from root and an MPI_Reduce to root in 
each iteration, No task gets far ahead because the task that finished the 
Bcast early, just wait longer at the Reduce.

An program that makes a call to a non-rooted collective every iteration 
will stay in pretty tight synch.

Think carefully before tossing in either MPI_Barrier or some non-blocking 
barrier.  Unless MPI_Bcast or MPI_Reduce is the only collective you call, 
your problem is likely not progress skew..


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Ashley Pittman <ash...@pittman.co.uk>
To:
Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:
09/09/2010 03:53 AM
Subject:
Re: [OMPI users] MPI_Reduce performance
Sent by:
users-boun...@open-mpi.org




On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>> As people have said, these time values are to be expected. All they
>> reflect is the time difference spent in reduce waiting for the slowest
>> process to catch up to everyone else. The barrier removes that factor
>> by forcing all processes to start from the same place.
>> 
>> 
>> No mystery here - just a reflection of the fact that your processes
>> arrive at the MPI_Reduce calls at different times.
> 
> 
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
> around, Gabriele?)
> 
> That's harder to explain as a sync issue.

Not really, you need some way of keeping processes in sync or else the 
slow ones get slower and the fast ones stay fast.  If you have an 
un-balanced algorithm then you can end up swamping certain ranks and when 
they get behind they get even slower and performance goes off a cliff 
edge.

Adding sporadic barriers keeps everything in sync and running nicely, if 
things are performing well then the barrier only slows things down but if 
there is a problem it'll bring all process back together and destroy the 
positive feedback cycle.  This is why you often only need a 
synchronisation point every so often, I'm also a huge fan of asyncronous 
barriers as a full sync is a blunt and slow operation, using asyncronous 
barriers you can allow small differences in timing but prevent them from 
getting too large with very little overhead in the common case where 
processes are synced already.  I'm thinking specifically of starting a 
sync-barrier on iteration N, waiting for it on N+25 and immediately 
starting another one, again waiting for it 25 steps later.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ralph Castain

On Sep 9, 2010, at 1:46 AM, Ashley Pittman wrote:

> 
> On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:
> 
>> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>>> As people have said, these time values are to be expected. All they
>>> reflect is the time difference spent in reduce waiting for the slowest
>>> process to catch up to everyone else. The barrier removes that factor
>>> by forcing all processes to start from the same place.
>>> 
>>> 
>>> No mystery here - just a reflection of the fact that your processes
>>> arrive at the MPI_Reduce calls at different times.
>> 
>> 
>> Yes, however, it seems Gabriele is saying the total execution time
>> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
>> around, Gabriele?)
>> 
>> That's harder to explain as a sync issue.
> 
> Not really, you need some way of keeping processes in sync or else the slow 
> ones get slower and the fast ones stay fast.  If you have an un-balanced 
> algorithm then you can end up swamping certain ranks and when they get behind 
> they get even slower and performance goes off a cliff edge.
> 
> Adding sporadic barriers keeps everything in sync and running nicely, if 
> things are performing well then the barrier only slows things down but if 
> there is a problem it'll bring all process back together and destroy the 
> positive feedback cycle.  This is why you often only need a synchronisation 
> point every so often,

Precisely. And that is why we added the "sync" collective which inserts a 
barrier every so often since we don't have async barriers at this time. Also 
helps clean up memory growth due to unanticipated messages arriving during 
unbalanced algorithms.

See ompi_info --param coll all to see how to enable it.

> I'm also a huge fan of asyncronous barriers as a full sync is a blunt and 
> slow operation, using asyncronous barriers you can allow small differences in 
> timing but prevent them from getting too large with very little overhead in 
> the common case where processes are synced already.  I'm thinking 
> specifically of starting a sync-barrier on iteration N, waiting for it on 
> N+25 and immediately starting another one, again waiting for it 25 steps 
> later.



> 
> Ashley.
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ashley Pittman

On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>> As people have said, these time values are to be expected. All they
>> reflect is the time difference spent in reduce waiting for the slowest
>> process to catch up to everyone else. The barrier removes that factor
>> by forcing all processes to start from the same place.
>> 
>> 
>> No mystery here - just a reflection of the fact that your processes
>> arrive at the MPI_Reduce calls at different times.
> 
> 
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
> around, Gabriele?)
> 
> That's harder to explain as a sync issue.

Not really, you need some way of keeping processes in sync or else the slow 
ones get slower and the fast ones stay fast.  If you have an un-balanced 
algorithm then you can end up swamping certain ranks and when they get behind 
they get even slower and performance goes off a cliff edge.

Adding sporadic barriers keeps everything in sync and running nicely, if things 
are performing well then the barrier only slows things down but if there is a 
problem it'll bring all process back together and destroy the positive feedback 
cycle.  This is why you often only need a synchronisation point every so often, 
I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow 
operation, using asyncronous barriers you can allow small differences in timing 
but prevent them from getting too large with very little overhead in the common 
case where processes are synced already.  I'm thinking specifically of starting 
a sync-barrier on iteration N, waiting for it on N+25 and immediately starting 
another one, again waiting for it 25 steps later.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Gabriele Fatigati
Yes Terry,

thats' right.

2010/9/9 Terry Frankcombe 

> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
> > As people have said, these time values are to be expected. All they
> > reflect is the time difference spent in reduce waiting for the slowest
> > process to catch up to everyone else. The barrier removes that factor
> > by forcing all processes to start from the same place.
> >
> >
> > No mystery here - just a reflection of the fact that your processes
> > arrive at the MPI_Reduce calls at different times.
>
>
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
> around, Gabriele?)
>
> That's harder to explain as a sync issue.
>
>
>
> > On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote:
> >
> > > More in depth,
> > >
> > >
> > > total execution time without Barrier is about 1 sec.
> > >
> > >
> > > Total execution time with Barrier+Reduce is 9453, with 128 procs.
> > >
> > > 2010/9/9 Terry Frankcombe 
> > > Gabriele,
> > >
> > > Can you clarify... those timings are what is reported for
> > > the reduction
> > > call specifically, not the total execution time?
> > >
> > > If so, then the difference is, to a first approximation, the
> > > time you
> > > spend sitting idly by doing absolutely nothing waiting at
> > > the barrier.
> > >
> > > Ciao
> > > Terry
> > >
> > >
> > > --
> > > Dr. Terry Frankcombe
> > > Research School of Chemistry, Australian National University
> > > Ph: (+61) 0417 163 509Skype: terry.frankcombe
> > >
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ing. Gabriele Fatigati
> > >
> > > Parallel programmer
> > >
> > > CINECA Systems & Tecnologies Department
> > >
> > > Supercomputing Group
> > >
> > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> > >
> > > www.cineca.itTel:   +39 051 6171722
> > >
> > > g.fatigati [AT] cineca.it
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Gabriele Fatigati
Mm,I don't understand.
The experiments on my appliciation shows that an intensive use of Barrier+
Reduce is more faster than a single Reduce.

2010/9/9 Ralph Castain 

> As people have said, these time values are to be expected. All they reflect
> is the time difference spent in reduce waiting for the slowest process to
> catch up to everyone else. The barrier removes that factor by forcing all
> processes to start from the same place.
>
> No mystery here - just a reflection of the fact that your processes arrive
> at the MPI_Reduce calls at different times.
>
>
> On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote:
>
> More in depth,
>
> total execution time without Barrier is about 1 sec.
>
> Total execution time with Barrier+Reduce is 9453, with 128 procs.
>
> 2010/9/9 Terry Frankcombe 
>
>> Gabriele,
>>
>> Can you clarify... those timings are what is reported for the reduction
>> call specifically, not the total execution time?
>>
>> If so, then the difference is, to a first approximation, the time you
>> spend sitting idly by doing absolutely nothing waiting at the barrier.
>>
>> Ciao
>> Terry
>>
>>
>> --
>> Dr. Terry Frankcombe
>> Research School of Chemistry, Australian National University
>> Ph: (+61) 0417 163 509Skype: terry.frankcombe
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Terry Frankcombe
On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
> As people have said, these time values are to be expected. All they
> reflect is the time difference spent in reduce waiting for the slowest
> process to catch up to everyone else. The barrier removes that factor
> by forcing all processes to start from the same place.
> 
> 
> No mystery here - just a reflection of the fact that your processes
> arrive at the MPI_Reduce calls at different times.


Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*.  (Is that the right way
around, Gabriele?)

That's harder to explain as a sync issue.



> On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote:
> 
> > More in depth,
> > 
> > 
> > total execution time without Barrier is about 1 sec.
> > 
> > 
> > Total execution time with Barrier+Reduce is 9453, with 128 procs.
> > 
> > 2010/9/9 Terry Frankcombe 
> > Gabriele,
> > 
> > Can you clarify... those timings are what is reported for
> > the reduction
> > call specifically, not the total execution time?
> > 
> > If so, then the difference is, to a first approximation, the
> > time you
> > spend sitting idly by doing absolutely nothing waiting at
> > the barrier.
> > 
> > Ciao
> > Terry
> > 
> > 
> > --
> > Dr. Terry Frankcombe
> > Research School of Chemistry, Australian National University
> > Ph: (+61) 0417 163 509Skype: terry.frankcombe
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > 
> > 
> > 
> > -- 
> > Ing. Gabriele Fatigati
> > 
> > Parallel programmer
> > 
> > CINECA Systems & Tecnologies Department
> > 
> > Supercomputing Group
> > 
> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> > 
> > www.cineca.itTel:   +39 051 6171722
> > 
> > g.fatigati [AT] cineca.it   
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Ralph Castain
As people have said, these time values are to be expected. All they reflect is 
the time difference spent in reduce waiting for the slowest process to catch up 
to everyone else. The barrier removes that factor by forcing all processes to 
start from the same place.

No mystery here - just a reflection of the fact that your processes arrive at 
the MPI_Reduce calls at different times.


On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote:

> More in depth,
> 
> total execution time without Barrier is about 1 sec.
> 
> Total execution time with Barrier+Reduce is 9453, with 128 procs.
> 
> 2010/9/9 Terry Frankcombe 
> Gabriele,
> 
> Can you clarify... those timings are what is reported for the reduction
> call specifically, not the total execution time?
> 
> If so, then the difference is, to a first approximation, the time you
> spend sitting idly by doing absolutely nothing waiting at the barrier.
> 
> Ciao
> Terry
> 
> 
> --
> Dr. Terry Frankcombe
> Research School of Chemistry, Australian National University
> Ph: (+61) 0417 163 509Skype: terry.frankcombe
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> -- 
> Ing. Gabriele Fatigati
> 
> Parallel programmer
> 
> CINECA Systems & Tecnologies Department
> 
> Supercomputing Group
> 
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> 
> www.cineca.itTel:   +39 051 6171722
> 
> g.fatigati [AT] cineca.it   
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Gabriele Fatigati
More in depth,

total execution time without Barrier is about 1 sec.

Total execution time with Barrier+Reduce is 9453, with 128 procs.

2010/9/9 Terry Frankcombe 

> Gabriele,
>
> Can you clarify... those timings are what is reported for the reduction
> call specifically, not the total execution time?
>
> If so, then the difference is, to a first approximation, the time you
> spend sitting idly by doing absolutely nothing waiting at the barrier.
>
> Ciao
> Terry
>
>
> --
> Dr. Terry Frankcombe
> Research School of Chemistry, Australian National University
> Ph: (+61) 0417 163 509Skype: terry.frankcombe
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread Gabriele Fatigati
Hy Terry,

this time is spent in MPI_Reduce, it isn't total execution time.

2010/9/9 Terry Frankcombe 

> Gabriele,
>
> Can you clarify... those timings are what is reported for the reduction
> call specifically, not the total execution time?
>
> If so, then the difference is, to a first approximation, the time you
> spend sitting idly by doing absolutely nothing waiting at the barrier.
>
> Ciao
> Terry
>
>
> --
> Dr. Terry Frankcombe
> Research School of Chemistry, Australian National University
> Ph: (+61) 0417 163 509Skype: terry.frankcombe
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] MPI_Reduce performance

2010-09-08 Thread Terry Frankcombe
Gabriele,

Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?

If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.

Ciao
Terry


-- 
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509Skype: terry.frankcombe



Re: [OMPI users] MPI_Reduce performance

2010-09-08 Thread Ashley Pittman

On 8 Sep 2010, at 10:21, Gabriele Fatigati wrote:
> So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to 
> mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the 
> same for others collective communications. Someone can explaine me why 
> MPI_reduce has this strange behaviour?

There are many cases where where adding an explicit barrier before a call to 
reduce would be superfluous so the standard rightly says that it isn't needed 
and need not be performed.  As you've seen though there are also cases where it 
can help.  I'd be interested to know the effect if you only added a barrier 
before MPI_Reduce occasionally, perhaps every one or two hundred iterations, 
this can also have a beneficial effect as a barrier every iteration adds 
significant overhead.

This is a textbook example of where the new asynchronous barrier could help, in 
theory it should have the effect of being able keep processes in sync without 
any additional overhead in the case that they are already well synchronised.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] MPI_Reduce performance

2010-09-08 Thread David Zhang
Doing Reduce without Barrier first allows one process to call Reduce and
exit immediately without waiting for other processes to call Reduce.
Therefore, this allows one process to advance faster than other processes.
I suspect the 2671 second result is the difference between the fastest and
slowest process.  Having Barrier reduce the time difference because it
forces the faster processes to go slower.

On Wed, Sep 8, 2010 at 3:21 AM, Gabriele Fatigati wrote:

> Dear OpenMPI users,
>
> i'm using OpenMPI 1.3.3 on Infiniband 4x interconnnection network. My
> parallel application use intensive MPI_Reduce communication over
> communicator created with MPI_Comm_split.
>
> I've noted strange behaviour during execution. My code is instrumented with
> Scalasca 1.3 to report subroutine execution time. First execution shows
> elapsed time with 128 processors ( job_communicator is created with
> MPI_Comm_split). In both cases is composed to the same ranks of
> MPI_COMM_WORLD:
>
> MPI_Reduce(.,job_communicator)
>
> The elapsed time is 2671 sec.
>
> Second run use MPI_BARRIER before MPI_Reduce:
>
> MPI_Barrier(job_communicator..)
> MPI_Reduce(.,job_communicator)
>
> The elapsed time of Barrier+Reduce is 2167 sec, (about 8 minutes less).
>
> So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to
> mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the
> same for others collective communications. Someone can explaine me why
> MPI_reduce has this strange behaviour?
>
> Thanks in advance.
>
>
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Zhang
University of California, San Diego