[OMPI devel] Major reduction in ORTE

Ralph H Castain Thu, 12 Jul 2007 17:04:09 -0400

Yo all

As we are discussing functional requirements for the upcoming 1.3 release, I
was asked to provide a little info about what is going to be happening to
the ORTE part of the code base over the remainder of this year.


Short answer: there will be a major code revision to reduce ORTE to the
minimum required to support Open MPI. This includes (a) a major design
change away from event-driven programming that will result in the
consolidation of several frameworks and removal of at least two others; and
(b) general cleanup to reduce memory footprint, startup message size, and
other areas.


Longer explanation:

At the beginning of the Open MPI project, it was quickly determined that
nobody (myself perhaps excepted) really wanted to build/maintain the RTE
underpinning Open MPI. We were, after all, primarily interested in MPI.
Hence, we thought it would be a good thing if we could define an RTE that
would be of adequate general interest to attract partners whose primary
focus would be extension and support of the RTE itself.

Well, after several years, it is clear that the original idea isn't going to
work (for a variety of reasons that aren't worth recounting here). We
therefore decided recently that it is time to accept the inevitable, quit
trying to support a more general RTE, and instead spend some effort reducing
the ORTE layer down to its most basic requirements. In particular, we want
to make the code easier to maintain and debug, faster and more scalable for
startup, and less vulnerable to race conditions.

In its essence, the plan consists of the following:

1. remove the cellid from the process name as the code will solely be a
single-cluster system. Other interested parties have offered to provide an
overlayer that will cross-connect Open MPI instances across clusters - we
will work with them to help facilitate the necessary hooks, but won't
duplicate that connectivity internally.

2. remove the RDS framework. All discovery and allocation will be done in a
single step in the RAS. We will revise the RAS to allow better co-existence
of resource manager specified allocations and hostfiles (more on that
later).

3. Eliminate the GPR framework, or at the very least, removal of the
subscribe/trigger functionality from it. We will be moving away from the
current event-driven architecture to reduce our exposure to race conditions
and eliminate the complexity caused by recursive callbacks due to trigger
events. We will explore globalized data storage in simplified arrays as an
alternative to the GPR database - initial tests support the idea, but
further work needs to be done. We know that people like the Eclipse PTP team
need access to certain data - we will work with them to figure out the best
way to do so given the changes to/departure of the GPR.

4. Consolidate the NS, PLS, RMGR, and SMR framework functionality into a
single process lifecycle management (PLM) framework. PLM components will
still call the ERRMGR to deal with response to process failures, and will
assume responsibility for storing their own data. The SCHEMA framework will
be eliminated as part of this change. We will move some functions (e.g.,
orte_abort) that are currently in the runtime and util areas into the PLM
components as appropriate.

5. Each framework will have logic in their respective "open" function that
specifically prevents them from performing component_open unless we are on
the HNP. If we are not on the HNP, an #if ORTE_WANT_NO_SUPPORT will force
the use of a "no_op" module that does nothing, but whose return codes will
indicate that an error did not occur. If that is not set, then a proxy
module will be utilized that provides appropriate communications to the HNP
to support remote applications. This will reduce memory footprints (since no
components will be opened) and allow us to simply pass-through MCA params to
all processes while ensuring proper functionality is available. Note that
environments like CNOS may still require special components in some of the
frameworks as the "no_op" may not be suitable for all API functions.

6. the SDS framework will not only support name discovery, but will hold all
backend operations required during startup. For example, the contents of the
message now sent back to the new PLM by each process will be dependent upon
environment. Hence, a one-to-one correspondence will be established between
PLM and SDS components.

7. consolidate the data in the MPI startup message (currently delivered at
STG1 stagegate). For example, any data in the MPI startup message that needs
to be indexed will be sent in an array sorted by vpid (no need to send the
entire list of process name structs). Whereas before we couldn't take
advantage of our knowledge of the message contents since it was generated by
the GPR (which by design had no insight into the data), we will now exploit
our knowledge to ensure the message is only that required by the specific
environment. We will look at, for example, the direct one-to-one
correspondence of PLM to SDS to see how this can best be implemented.


Other things (e.g., routing of RML messages) are either already under
development or under discussion - we will provide more info on these as they
move along.

As always, any thoughts/suggestions are welcomed.

Ralph

[OMPI devel] Major reduction in ORTE

Reply via email to