Re: [OMPI devel] Major revision to the RML/OOB

2006-12-08 Thread Adrian Knoth
On Thu, Dec 07, 2006 at 11:12:23AM -0500, Jeff Squyres wrote:

Hi,

> > I therefore suggest to move the OPAL changes into the trunk,
> > also the small hostfile code (lex code for IPv6) and the btl code.
> Can you describe the changes in opal that were made for IPv6?

These changes are limited to three files: opal/util/if.[ch] and
the new opal/include/opal/ipv6compat.h. The latter one is only
required for compatibility with old SUSv2 systems.

In if.c, I've added IPv6 interface discovery for Linux and Solaris,
Thomas Peiselt also contributed getifaddrs() support for *BSD/OSX.
Helper functions were extended to deal with struct sockaddr_storage.

I've introduced CIDR netmask handling, so the netmask no longer
holds something like  (a.s.o), but simply 8, 16 or
whatever. There are helper functions to convert from and to CIDR.

/* convert a netmask (in network byte order) to CIDR notation */
static int prefix (uint32_t netmask)

/* convert a CIDR prefixlen to netmask (in network byte order) */
uint32_t opal_prefix2netmask (uint32_t prefixlen)

I've also extended the interface struct, still containing if_index,
but that's just its number in the opal_list. The new field is
called if_kernel_index, representing the associated kernel interface
index for this device. My BTL/TCP code also exchanges this new
information to enable the remote to detect if two or more addresses
are assigned to the same interface, thus preventing oversubscription
(multiple connections to the same interface but to difference addresses,
 which is very likely if you have at least one IPv6 address and one
 IPv4 address on the same interface)

The code in if.c handles both, AF_INET and AF_INET6, so it's no
problem to use it without using IPv6 somewhere else (i.e. oob/tcp,
btl/tcp).

HTH

-- 
mail: a...@thur.de  http://adi.thur.de  PGP: v2-key via keyserver

Drink wet cement and get really stoned!


Re: [OMPI devel] Major revision to the RML/OOB

2006-12-07 Thread Jeff Squyres

On Dec 6, 2006, at 9:59 AM, Adrian Knoth wrote:

The concern is that we want to leave open the possibility of  
putting this
revision into 1.2 since it will have a major performance impact on  
both

startup time and the max cluster size we can support. The IP6 code is
scheduled for 1.3 and we don't know what the performance impact  
will look

like - hence the hesitation.


I agree not to include IPv6 in the v1.2 (you might remove the  
configure

patch from the v1.2 line, or leave it there without really using it)

If one considers the current v1.2 branch as stable, the trunk could
be used for the new v1.3 line.


That's the plan -- once we fork off a branch for a release series,  
the trunk assumes the identity of the next release series.  Hence,  
there's branches for 1.0, 1.1, and 1.2, and therefore the trunk is  
currently the 1.3 series.  Once we branch for 1.3, the trunk will  
become 1.4.  And so on.



I therefore suggest to move the OPAL changes into the trunk,
also the small hostfile code (lex code for IPv6) and the btl code.


Can you describe the changes in opal that were made for IPv6?


When you've completed all changes to the OOB, we can have a look
and do the necessary IPv6 changes afterwards. Though I feel the oob/ 
tcp

is the hardest part of all (it got the most modifications), I hope
that it's possible to copy a lot of the existing patch. Perhaps
your rewrite simplifies something.


I don't think that it'll change much in your code (a total guess, but  
based on what I think needs changing in the oob tcp).  The main  
things we'll be changing is *when* socket connections are made and  
how the tcp component gets the contact info for the other procs.



I'm currently not developing new code, so at least the IPv6 codebase
isn't a moving target.


Excellent.  Thanks for being diligent about this!

--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI devel] Major revision to the RML/OOB

2006-12-06 Thread Adrian Knoth
On Wed, Dec 06, 2006 at 07:07:42AM -0700, Ralph H Castain wrote:

> The concern is that we want to leave open the possibility of putting this
> revision into 1.2 since it will have a major performance impact on both
> startup time and the max cluster size we can support. The IP6 code is
> scheduled for 1.3 and we don't know what the performance impact will look
> like - hence the hesitation.

I agree not to include IPv6 in the v1.2 (you might remove the configure
patch from the v1.2 line, or leave it there without really using it)

If one considers the current v1.2 branch as stable, the trunk could
be used for the new v1.3 line.

I therefore suggest to move the OPAL changes into the trunk,
also the small hostfile code (lex code for IPv6) and the btl code.

When you've completed all changes to the OOB, we can have a look
and do the necessary IPv6 changes afterwards. Though I feel the oob/tcp
is the hardest part of all (it got the most modifications), I hope
that it's possible to copy a lot of the existing patch. Perhaps
your rewrite simplifies something.

I'm currently not developing new code, so at least the IPv6 codebase
isn't a moving target.


Just let me know if I could help.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universit├Ąt Jena, Germany  


Re: [OMPI devel] Major revision to the RML/OOB

2006-12-06 Thread Ralph H Castain
The changes we are planning to do will in no way preclude the use of
multicast for the xcast procedure. The changes in the OOB subsystem deal
specifically with how those connections are initialized, which is something
we would need to do for multicast anyway.

The routing method for the xcast is already selectable (at least, on the
trunk) - there is no problem with adding a multicast option in that
procedure. If someone wishes to do so, please feel free! I'm not sure
when/if I'll get around to it.

Ralph


On 12/4/06 3:35 PM, "Jonathan Day"  wrote:

> Whilst I can see these changes being good in the
> general case (most clusters are designed with very
> smart NICs and painfully dumb switches, because that
> produces the best latencies for many topologies), I
> would suggest that we can do better on smarter
> networks.
> 
> There is no obvious reason why you could not establish
> a well-known multicast address/port for out-of-band
> traffic. A reliable multicast protocol, such as SRM,
> NORM or FLUTE could then be used to carry the
> information between nodes.
> 
> The advantage of this approach is that it requires the
> least alteration to the code - a single transmission
> to the group address as opposed to one transmission to
> each target - AND would work perfectly well with the
> new approach described.
> 
> The drawbacks are that it would have to be switchable,
> though, as multicast is truly horrible on dumber
> devices, development resources aren't infinite and the
> number of cases it will actually win on are limited.
> 
> (It's entirely coincidental that this is a capability
> that I actually need. Well, almost!)
> 
> Jonathan Day
> 
>> Message: 1
>> Date: Mon, 04 Dec 2006 06:26:26 -0700
>> From: Ralph Castain 
>> Subject: [OMPI devel] Major revision to the RML/OOB
>> To: Open MPI Core Developers
>> , Open MPI
>> Developers 
>> Message-ID: 
>> Content-Type: text/plain; charset="US-ASCII"
>> 
>> Hello all
>> 
>> If you are interested in the ongoing scalability
>> work, or in the RML/OOB in
>> ORTE, please read on - otherwise, feel free to hit
>> "delete".
>> 
>> As many of you know, we have been working towards
>> solving several problems
>> that affect our ability to operate at large scale.
>> Some of the required
>> modifications to the code base have recently been
>> applied to the trunk.
>> 
>> We have known since it was originally written over
>> two years ago that the
>> OOB contained some inherent scalability limits. For
>> example, the system
>> immediately upon opening obtains contact info for
>> all daemons in the
>> universe, opens sockets to them, and sends an
>> initial message to them. It
>> then does the same with all the application
>> processes in its job.
>> 
>> As a result, for a 2000 process job running on 500
>> nodes, each application
>> process will immediately open and communicate across
>> 2501 sockets (2000
>> procs + 500 daemons [one per node] + the HNP) during
>> the startup phase.
>> 
>> If you really want to imagine some fun, now have
>> that job comm_spawn 500
>> processes across the 500 nodes, and *don't* reuse
>> daemons. As each new
>> daemon is spawned, every process in the original job
>> (including the original
>> daemons) is notified, loads the new contact info for
>> that daemon, opens a
>> socket to it, and does an "ack" comm. After all 500
>> new daemons are running,
>> they now launch the 500 new procs, each of which
>> gets the info on 1000
>> daemons plus the info for 2000 parents and 500
>> peers, and immediately opens
>> 1000 daemons + 2000 parents + 500 peers + 1 HNP =
>> 3501 sockets!
>> 
>> This was acceptable for small jobs, but causes
>> considerable delay during
>> startup for large jobs. A few other OOB operational
>> characteristics further
>> exacerbate the problem - I will detail those in a
>> document on the wiki to
>> help foster greater understanding.
>> 
>> Jeff Squyres and I are about to begin a major
>> revision of the RML/OOB code
>> to resolve these problems. We will be using a staged
>> approach to the effort:
>> 
>> 1. separate the OOB's actions for loading contact
>> info from actually opening
>> a socket to a process. Currently, the OOB
>> immediately opens a socket and
>> performs an "ack" communication whenever contact
>> info for another process is
>> loaded into it. In addition, the OOB immediately
>> subscribes to the job
>> segment of the provided process, requesting that
>> this process be alerted to
>> *any* change in OOB contact info to any process in
>> that job. These actions
>> need to be separated out.
>> 
>> 2. revise the RML/OOB init/open procedure. These are
>> currently interwoven in
>> a manner that causes the OOB to execute registry
>> operations that are not
>> needed (and actually cause headaches) during
>> orte_init. The procedure will
>> be revised so that connections to the HNP and to the
>> process' 

Re: [OMPI devel] Major revision to the RML/OOB

2006-12-06 Thread Ralph H Castain
We aren't ignoring your situation, Adrian - Jeff and I are talking about how
best to deal with the situation and your offer to help. This revision will
indeed see some significant change in the oob/tcp component, mostly in the
init and connect procedures.

The concern is that we want to leave open the possibility of putting this
revision into 1.2 since it will have a major performance impact on both
startup time and the max cluster size we can support. The IP6 code is
scheduled for 1.3 and we don't know what the performance impact will look
like - hence the hesitation.

We are both a little buried at the moment with other crises, but I hope we
can give you a more intelligent reply shortly.

Thanks
Ralph


On 12/5/06 11:18 AM, "Adrian Knoth"  wrote:

> On Mon, Dec 04, 2006 at 06:26:26AM -0700, Ralph Castain wrote:
> 
>> Hello all
> 
> Hi!
>  
>> With some luck and (hopefully) not too many conflicting priorities, Jeff
>> and I may complete this work by Christmas
> [..]
>> As always, feel free to comment and/or make suggestions!
> 
> You wrote a lot about oob, sockets and connections. Does this
> imply changes to oob/tcp? If so, I suggest to integrate the
> IPv6 support first (may be ported from /tmp/adi-ipv6, see
> 
> for details).
> 
> Of course, I'd like to help. Has anybody ever tested the code?
> (surely we did, but someone else?)
> 




Re: [OMPI devel] Major revision to the RML/OOB

2006-12-05 Thread Adrian Knoth
On Mon, Dec 04, 2006 at 06:26:26AM -0700, Ralph Castain wrote:

> Hello all

Hi!

> With some luck and (hopefully) not too many conflicting priorities, Jeff
> and I may complete this work by Christmas
[..]
> As always, feel free to comment and/or make suggestions!

You wrote a lot about oob, sockets and connections. Does this
imply changes to oob/tcp? If so, I suggest to integrate the
IPv6 support first (may be ported from /tmp/adi-ipv6, see

for details).

Of course, I'd like to help. Has anybody ever tested the code?
(surely we did, but someone else?)


-- 
mail: a...@thur.de  http://adi.thur.de  PGP: v2-key via keyserver

Windows 98? Warum? Ich hab' das alte noch nicht zu Ende gespielt.


Re: [OMPI devel] Major revision to the RML/OOB

2006-12-04 Thread Jonathan Day
Whilst I can see these changes being good in the
general case (most clusters are designed with very
smart NICs and painfully dumb switches, because that
produces the best latencies for many topologies), I
would suggest that we can do better on smarter
networks.

There is no obvious reason why you could not establish
a well-known multicast address/port for out-of-band
traffic. A reliable multicast protocol, such as SRM,
NORM or FLUTE could then be used to carry the
information between nodes.

The advantage of this approach is that it requires the
least alteration to the code - a single transmission
to the group address as opposed to one transmission to
each target - AND would work perfectly well with the
new approach described.

The drawbacks are that it would have to be switchable,
though, as multicast is truly horrible on dumber
devices, development resources aren't infinite and the
number of cases it will actually win on are limited.

(It's entirely coincidental that this is a capability
that I actually need. Well, almost!)

Jonathan Day

> Message: 1
> Date: Mon, 04 Dec 2006 06:26:26 -0700
> From: Ralph Castain 
> Subject: [OMPI devel] Major revision to the RML/OOB
> To: Open MPI Core Developers
> ,Open MPI
>   Developers 
> Message-ID: 
> Content-Type: text/plain; charset="US-ASCII"
> 
> Hello all
> 
> If you are interested in the ongoing scalability
> work, or in the RML/OOB in
> ORTE, please read on - otherwise, feel free to hit
> "delete".
> 
> As many of you know, we have been working towards
> solving several problems
> that affect our ability to operate at large scale.
> Some of the required
> modifications to the code base have recently been
> applied to the trunk.
> 
> We have known since it was originally written over
> two years ago that the
> OOB contained some inherent scalability limits. For
> example, the system
> immediately upon opening obtains contact info for
> all daemons in the
> universe, opens sockets to them, and sends an
> initial message to them. It
> then does the same with all the application
> processes in its job.
> 
> As a result, for a 2000 process job running on 500
> nodes, each application
> process will immediately open and communicate across
> 2501 sockets (2000
> procs + 500 daemons [one per node] + the HNP) during
> the startup phase.
> 
> If you really want to imagine some fun, now have
> that job comm_spawn 500
> processes across the 500 nodes, and *don't* reuse
> daemons. As each new
> daemon is spawned, every process in the original job
> (including the original
> daemons) is notified, loads the new contact info for
> that daemon, opens a
> socket to it, and does an "ack" comm. After all 500
> new daemons are running,
> they now launch the 500 new procs, each of which
> gets the info on 1000
> daemons plus the info for 2000 parents and 500
> peers, and immediately opens
> 1000 daemons + 2000 parents + 500 peers + 1 HNP =
> 3501 sockets!
> 
> This was acceptable for small jobs, but causes
> considerable delay during
> startup for large jobs. A few other OOB operational
> characteristics further
> exacerbate the problem - I will detail those in a
> document on the wiki to
> help foster greater understanding.
> 
> Jeff Squyres and I are about to begin a major
> revision of the RML/OOB code
> to resolve these problems. We will be using a staged
> approach to the effort:
> 
> 1. separate the OOB's actions for loading contact
> info from actually opening
> a socket to a process. Currently, the OOB
> immediately opens a socket and
> performs an "ack" communication whenever contact
> info for another process is
> loaded into it. In addition, the OOB immediately
> subscribes to the job
> segment of the provided process, requesting that
> this process be alerted to
> *any* change in OOB contact info to any process in
> that job. These actions
> need to be separated out.
> 
> 2. revise the RML/OOB init/open procedure. These are
> currently interwoven in
> a manner that causes the OOB to execute registry
> operations that are not
> needed (and actually cause headaches) during
> orte_init. The procedure will
> be revised so that connections to the HNP and to the
> process' local orted
> are opened, but all other contact info (e.g., for
> the other procs in the
> job) is simply loaded into the OOB's contact tables,
> but no sockets opened
> until first communication.
> 
> 3. revise the xcast procedure so that it relays via
> the daemons and not the
> application processes. For systems that do not use
> our daemons, alternative
> mechanisms will be developed.
> 
> At some point in the future, a fully routable OOB
> will be developed to
> remove the need for so many sockets on each
> application process. For now,
> these steps should improve our startup time
> considerably.
> 
> With some luck and (hopefully) not too many
> conflicting priorities, Jeff and
> I may complete this