Yo all As we are discussing functional requirements for the upcoming 1.3 release, I was asked to provide a little info about what is going to be happening to the ORTE part of the code base over the remainder of this year.
Short answer: there will be a major code revision to reduce ORTE to the minimum required to support Open MPI. This includes (a) a major design change away from event-driven programming that will result in the consolidation of several frameworks and removal of at least two others; and (b) general cleanup to reduce memory footprint, startup message size, and other areas. Longer explanation: At the beginning of the Open MPI project, it was quickly determined that nobody (myself perhaps excepted) really wanted to build/maintain the RTE underpinning Open MPI. We were, after all, primarily interested in MPI. Hence, we thought it would be a good thing if we could define an RTE that would be of adequate general interest to attract partners whose primary focus would be extension and support of the RTE itself. Well, after several years, it is clear that the original idea isn't going to work (for a variety of reasons that aren't worth recounting here). We therefore decided recently that it is time to accept the inevitable, quit trying to support a more general RTE, and instead spend some effort reducing the ORTE layer down to its most basic requirements. In particular, we want to make the code easier to maintain and debug, faster and more scalable for startup, and less vulnerable to race conditions. In its essence, the plan consists of the following: 1. remove the cellid from the process name as the code will solely be a single-cluster system. Other interested parties have offered to provide an overlayer that will cross-connect Open MPI instances across clusters - we will work with them to help facilitate the necessary hooks, but won't duplicate that connectivity internally. 2. remove the RDS framework. All discovery and allocation will be done in a single step in the RAS. We will revise the RAS to allow better co-existence of resource manager specified allocations and hostfiles (more on that later). 3. Eliminate the GPR framework, or at the very least, removal of the subscribe/trigger functionality from it. We will be moving away from the current event-driven architecture to reduce our exposure to race conditions and eliminate the complexity caused by recursive callbacks due to trigger events. We will explore globalized data storage in simplified arrays as an alternative to the GPR database - initial tests support the idea, but further work needs to be done. We know that people like the Eclipse PTP team need access to certain data - we will work with them to figure out the best way to do so given the changes to/departure of the GPR. 4. Consolidate the NS, PLS, RMGR, and SMR framework functionality into a single process lifecycle management (PLM) framework. PLM components will still call the ERRMGR to deal with response to process failures, and will assume responsibility for storing their own data. The SCHEMA framework will be eliminated as part of this change. We will move some functions (e.g., orte_abort) that are currently in the runtime and util areas into the PLM components as appropriate. 5. Each framework will have logic in their respective "open" function that specifically prevents them from performing component_open unless we are on the HNP. If we are not on the HNP, an #if ORTE_WANT_NO_SUPPORT will force the use of a "no_op" module that does nothing, but whose return codes will indicate that an error did not occur. If that is not set, then a proxy module will be utilized that provides appropriate communications to the HNP to support remote applications. This will reduce memory footprints (since no components will be opened) and allow us to simply pass-through MCA params to all processes while ensuring proper functionality is available. Note that environments like CNOS may still require special components in some of the frameworks as the "no_op" may not be suitable for all API functions. 6. the SDS framework will not only support name discovery, but will hold all backend operations required during startup. For example, the contents of the message now sent back to the new PLM by each process will be dependent upon environment. Hence, a one-to-one correspondence will be established between PLM and SDS components. 7. consolidate the data in the MPI startup message (currently delivered at STG1 stagegate). For example, any data in the MPI startup message that needs to be indexed will be sent in an array sorted by vpid (no need to send the entire list of process name structs). Whereas before we couldn't take advantage of our knowledge of the message contents since it was generated by the GPR (which by design had no insight into the data), we will now exploit our knowledge to ensure the message is only that required by the specific environment. We will look at, for example, the direct one-to-one correspondence of PLM to SDS to see how this can best be implemented. Other things (e.g., routing of RML messages) are either already under development or under discussion - we will provide more info on these as they move along. As always, any thoughts/suggestions are welcomed. Ralph