Re: [OMPI devel] Major commit to trunk

2007-06-12 Thread Jeff Squyres

Sounds good Ralph; thanks!

On Jun 12, 2007, at 9:54 AM, Ralph H Castain wrote:


Yo all

I made a major commit to the trunk this morning (r15007) that  
merits general

notification and some explanation.

   *** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for  
several
environments will be broken. This commit is known to break support  
for the
following environments: POE, Xgrid, Xcpu, Windows - these  
environments will

not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not be  
verified due to
machine problems at LANL. Modifications for SGE have been made but  
could not
be verified. I will send out a separate note to developers of the  
borked
environments with suggestions on how to fix the problems. These  
should be
relatively minor, mostly involving a minor change to a couple of  
function

calls and the addition of one function call in their respective launch
functions.


As many of you have noted, the ORTE launch procedure relies heavily  
on the
orte_rml.xcast function to communicate occasionally large messages  
to every

process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process.  
Obviously,
as many of you have pointed out, this was a very inefficient  
methodology.


This commit repairs that problem, but it comes with a few side  
effects. You
shouldn't notice anything different (except hopefully for faster  
starts),

but I will note the differences here.

First, orte_rml.xcast has become a general broadcast-like messaging  
system.
Messages can now be sent to any tag on the daemons or processes.  
Note that

any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date,  
we will
introduce an augmented capability that will use the daemons as  
relays, but

will allow you to send to a specified array of process names.

We also modified orte_rml.xcast so it supports more scalable  
message routing

methodologies. At the moment, we support three:

(a) direct, which sends the message directly to all recipients. By  
default,
this mode is used whenever we have less than 10 daemons. You can  
adjust that
crossover point via the oob_xcast_linear_xover param - set this  
param to the
number of daemons where you want direct to give way to linear.  
Obviously, if
you set this to something very large, then we will only use direct  
xcast
mode - set it to zero, and we won't use direct at all.  
Alternatively, you

can force the use of direct at all scales by setting oob_xcast_mode to
"direct".

(b) linear, which sends the message to the local daemon on each  
node. The
daemon then relays it to its own local procs. Note that the daemons  
in this
mode do not relay the message between themselves, but only to their  
local
procs. As per a prior message, I have set linear to be the default  
mode on

all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as described  
above). You can
also set an upper bound where linear gives way to binomial by  
setting the
oob_xcast_binomial_xover param. Alternatively, you can force the  
use of

linear at all scales by setting oob_xcast_mode to "linear".

(c) binomial, which sends the message via a binomial algo across  
all the
daemons, each of which then relays to its own local procs. This is  
just a
typical binomial algorithm across the daemons. At this time, I have  
set the
default on this mode to be "off" so it will never kick on. If you  
want to
try it out, you will need to either adjust the xover param (as  
described

above), or set oob_xcast_mode to "binomial".

Please note that we *do* use the direct messaging mode whenever  
there is
only one daemon in the system. This is non-negotiable - it is  
mandated for
singletons and for getting mpirun up and running. Besides, if there  
is only
one daemon in the system, every message goes "direct" no matter  
which mode

you pick, so you shouldn't care. ;-)

Also note that the current crossover points were totally arbitrary.  
I have
no data to base those crossovers on, so I simply picked something  
for now.
If those of you with access to larger systems and with some free  
time could
try various values, then we could come up with something more  
intelligent.

Any data would be most appreciated!

This commit also involved a significant change to the orteds  
themselves. The
requirement that orteds *always* be available to relay messages  
mandated a

change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM.  
This is no
longer supported. Orteds now always behave like they are part of a  
virtual
machine - they simply launch a job if mpirun tells them to do so.  
This is
another step towards creating an 

[OMPI devel] Major commit to trunk

2007-06-12 Thread Ralph H Castain
Yo all

I made a major commit to the trunk this morning (r15007) that merits general
notification and some explanation.

   *** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for several
environments will be broken. This commit is known to break support for the
following environments: POE, Xgrid, Xcpu, Windows - these environments will
not compile at this time. It has been tested on rsh, SLURM, and Bproc.
Modifications for TM support have been made but could not be verified due to
machine problems at LANL. Modifications for SGE have been made but could not
be verified. I will send out a separate note to developers of the borked
environments with suggestions on how to fix the problems. These should be
relatively minor, mostly involving a minor change to a couple of function
calls and the addition of one function call in their respective launch
functions.


As many of you have noted, the ORTE launch procedure relies heavily on the
orte_rml.xcast function to communicate occasionally large messages to every
process in a job. This procedure has - until now - been a linear
communication that sent the messages directly to every process. Obviously,
as many of you have pointed out, this was a very inefficient methodology.

This commit repairs that problem, but it comes with a few side effects. You
shouldn't notice anything different (except hopefully for faster starts),
but I will note the differences here.

First, orte_rml.xcast has become a general broadcast-like messaging system.
Messages can now be sent to any tag on the daemons or processes. Note that
any message sent via xcast will be delivered to ALL processes in the
specified job - you don't get to pick and choose. At a later date, we will
introduce an augmented capability that will use the daemons as relays, but
will allow you to send to a specified array of process names.

We also modified orte_rml.xcast so it supports more scalable message routing
methodologies. At the moment, we support three:

(a) direct, which sends the message directly to all recipients. By default,
this mode is used whenever we have less than 10 daemons. You can adjust that
crossover point via the oob_xcast_linear_xover param - set this param to the
number of daemons where you want direct to give way to linear. Obviously, if
you set this to something very large, then we will only use direct xcast
mode - set it to zero, and we won't use direct at all. Alternatively, you
can force the use of direct at all scales by setting oob_xcast_mode to
"direct".

(b) linear, which sends the message to the local daemon on each node. The
daemon then relays it to its own local procs. Note that the daemons in this
mode do not relay the message between themselves, but only to their local
procs. As per a prior message, I have set linear to be the default mode on
all jobs involving more than 10 daemons. Again, you can adjust this by
setting a lower bound on where linear kicks in (as described above). You can
also set an upper bound where linear gives way to binomial by setting the
oob_xcast_binomial_xover param. Alternatively, you can force the use of
linear at all scales by setting oob_xcast_mode to "linear".

(c) binomial, which sends the message via a binomial algo across all the
daemons, each of which then relays to its own local procs. This is just a
typical binomial algorithm across the daemons. At this time, I have set the
default on this mode to be "off" so it will never kick on. If you want to
try it out, you will need to either adjust the xover param (as described
above), or set oob_xcast_mode to "binomial".

Please note that we *do* use the direct messaging mode whenever there is
only one daemon in the system. This is non-negotiable - it is mandated for
singletons and for getting mpirun up and running. Besides, if there is only
one daemon in the system, every message goes "direct" no matter which mode
you pick, so you shouldn't care. ;-)

Also note that the current crossover points were totally arbitrary. I have
no data to base those crossovers on, so I simply picked something for now.
If those of you with access to larger systems and with some free time could
try various values, then we could come up with something more intelligent.
Any data would be most appreciated!

This commit also involved a significant change to the orteds themselves. The
requirement that orteds *always* be available to relay messages mandated a
change in the way they come alive. In the past, orteds bootstrapped
themselves in two totally different code paths: bootproxy or VM. This is no
longer supported. Orteds now always behave like they are part of a virtual
machine - they simply launch a job if mpirun tells them to do so. This is
another step towards creating an "orteboot" functionality, but also provided
a clean system for supporting message relaying.

Note one major impact of this commit: multiple daemons on a node cannot be
supported any longer! Only a single