Re: [OMPI devel] Major revision to the RML/OOB

2006-12-04 Thread Jonathan Day
Whilst I can see these changes being good in the
general case (most clusters are designed with very
smart NICs and painfully dumb switches, because that
produces the best latencies for many topologies), I
would suggest that we can do better on smarter
networks.

There is no obvious reason why you could not establish
a well-known multicast address/port for out-of-band
traffic. A reliable multicast protocol, such as SRM,
NORM or FLUTE could then be used to carry the
information between nodes.

The advantage of this approach is that it requires the
least alteration to the code - a single transmission
to the group address as opposed to one transmission to
each target - AND would work perfectly well with the
new approach described.

The drawbacks are that it would have to be switchable,
though, as multicast is truly horrible on dumber
devices, development resources aren't infinite and the
number of cases it will actually win on are limited.

(It's entirely coincidental that this is a capability
that I actually need. Well, almost!)

Jonathan Day

> Message: 1
> Date: Mon, 04 Dec 2006 06:26:26 -0700
> From: Ralph Castain 
> Subject: [OMPI devel] Major revision to the RML/OOB
> To: Open MPI Core Developers
> ,Open MPI
>   Developers 
> Message-ID: 
> Content-Type: text/plain; charset="US-ASCII"
> 
> Hello all
> 
> If you are interested in the ongoing scalability
> work, or in the RML/OOB in
> ORTE, please read on - otherwise, feel free to hit
> "delete".
> 
> As many of you know, we have been working towards
> solving several problems
> that affect our ability to operate at large scale.
> Some of the required
> modifications to the code base have recently been
> applied to the trunk.
> 
> We have known since it was originally written over
> two years ago that the
> OOB contained some inherent scalability limits. For
> example, the system
> immediately upon opening obtains contact info for
> all daemons in the
> universe, opens sockets to them, and sends an
> initial message to them. It
> then does the same with all the application
> processes in its job.
> 
> As a result, for a 2000 process job running on 500
> nodes, each application
> process will immediately open and communicate across
> 2501 sockets (2000
> procs + 500 daemons [one per node] + the HNP) during
> the startup phase.
> 
> If you really want to imagine some fun, now have
> that job comm_spawn 500
> processes across the 500 nodes, and *don't* reuse
> daemons. As each new
> daemon is spawned, every process in the original job
> (including the original
> daemons) is notified, loads the new contact info for
> that daemon, opens a
> socket to it, and does an "ack" comm. After all 500
> new daemons are running,
> they now launch the 500 new procs, each of which
> gets the info on 1000
> daemons plus the info for 2000 parents and 500
> peers, and immediately opens
> 1000 daemons + 2000 parents + 500 peers + 1 HNP =
> 3501 sockets!
> 
> This was acceptable for small jobs, but causes
> considerable delay during
> startup for large jobs. A few other OOB operational
> characteristics further
> exacerbate the problem - I will detail those in a
> document on the wiki to
> help foster greater understanding.
> 
> Jeff Squyres and I are about to begin a major
> revision of the RML/OOB code
> to resolve these problems. We will be using a staged
> approach to the effort:
> 
> 1. separate the OOB's actions for loading contact
> info from actually opening
> a socket to a process. Currently, the OOB
> immediately opens a socket and
> performs an "ack" communication whenever contact
> info for another process is
> loaded into it. In addition, the OOB immediately
> subscribes to the job
> segment of the provided process, requesting that
> this process be alerted to
> *any* change in OOB contact info to any process in
> that job. These actions
> need to be separated out.
> 
> 2. revise the RML/OOB init/open procedure. These are
> currently interwoven in
> a manner that causes the OOB to execute registry
> operations that are not
> needed (and actually cause headaches) during
> orte_init. The procedure will
> be revised so that connections to the HNP and to the
> process' local orted
> are opened, but all other contact info (e.g., for
> the other procs in the
> job) is simply loaded into the OOB's contact tables,
> but no sockets opened
> until first communication.
> 
> 3. revise the xcast procedure so that it relays via
> the daemons and not the
> application processes. For systems that do not use
> our daemons, alternative
> mechanisms will be developed.
> 
> At some p

[OMPI devel] Question on "get" operation

2006-06-07 Thread Jonathan Day
Hi,

Sorry if this sounds idiotic, but I'm having problems
with the MPI get operation in OpenMPI. I have test
program that calls OpenMPI's get operation, which
performs a send operation. This fails with a null
pointer exception in the opal library, after preparing
the source.

With the shared memory driver, when performing a get,
the shared memory code seems to be being passed a null
pointer. Using the TCP driver also crashes with
dereferencing a null pointer.

Anyone have any suggestions on what might be causing
the problem? I assume others are using get, so
presumably someone else will have encountered this
problem (assuming it's a quirk that's in a common
component and not in the test program).

Jonathan


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


[OMPI devel] Query on zero-copy sends

2006-06-02 Thread Jonathan Day
Hi,

I'm working on developing some components for OpenMPI,
but am a little unclear as to how to implement
efficient sends and receives. I'm wanting to do
zero-copy two-sided MPI, but as far as I can see, this
is not going to be easy. As best as I can tell, the
receive mechanism copies into a temporary user buffer
then, on actually handling the receive, copies that
into the application's buffer. Would I be correct in
this interpretation?

I'm also a little hazy on how to get information on
messages being passed. What information on the sending
process is visible to the receiving BTL components?

Finally, I'm assuming that developers have, over time,
produced test harnesses and other useful (for
developers) tools that would have no real value to
general users. Has anyone put together a kit of
development aids for coders of new components?

Jonathan Day


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com