The corresponding PMIx RFC is now available for comment: https://github.com/pmix/RFCs/pull/2 <https://github.com/pmix/RFCs/pull/2>
> On Jun 9, 2016, at 8:37 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Hi folks > > There is a PR that has cleared Jenkins, but it represents a fairly > significant change in OMPI capabilities. Thus, I think it merits a little > more attention. > > The PR (https://github.com/open-mpi/ompi/pull/1767 > <https://github.com/open-mpi/ompi/pull/1767>) brings the PMIx event > notification system into OMPI. Quoting from the PMIx RFC: > > =============================== > The PMIx Event Notification system provides a mechanism by which the resource > manager can communicate system events to applications, thus providing > applications with an opportunity to generate an appropriate response. In > addition, applications can use the system to request that the resource > manager notify their peers of internal events (e.g., computational errors and > aborted operations), and notify the resource manager of events detected by > the application. > > The resource manager will be aware of a wide range of events that occur > across the system. For the purposes of this discussion, only events that > impact the allocated session being served by the PMIx server are considered. > These events can be divided into two distinct classes: > > * Job-specific events that directly relate to a job executing within the > session. This might include events such as debugger attachment or process > failure within a related job. These events are characterized by directly > targeting processes within session jobs - i.e., the "procs" parameter of the > notification contain members of a job executing within the session. Events in > this category are to be immediately delivered to the PMIx server library for > delivery to the specified processes. > > Clients can indicate a desire to register solely for job-specific events by > including the _PMIX\_EVENT\_JOB\_LEVEL_ key in their call to > _PMIx\_Register\_event_ - i.e., providing this key will explicitly indicate > that environment events are _not_ to be reported to this callback function. > > * Environment events that impact the session, but are not directly sent to > executing jobs. This is a much broader category of events that includes ECC > errors, temperature excursions, and other environmental events directly > affecting the session's resources. Note that although these do impact the > session's jobs, they are not directly referencing those jobs - i.e., the > event is generated without specifying a particular target. Thus, events in > this category are to be delivered to the PMIx server library only upon > request - i.e., when the PMIx server has registered for those events. > > Note that race conditions can cause the registration to come _after_ events > of possible interest (e.g., a memory ECC event that occurs after start of > execution but prior to registration). RMs are free to cache events in this > category for some time to mitigate this situation, but are not required to do > so. Thus, applications must be aware that environment events prior to > registration may not be included in notifications. > > As above, clients can indicate a desire to register solely for environment > events of a given type by include the _PMIX\_EVENT\_ENVIRO\_LEVEL_ key in > their registration call. > > The PMIx server will cache any environment events passed to it for a period > of time to provide notification to clients that have not yet registered for > them. Currently, the PMIx server uses a ring buffer to cache events. The size > of the ring buffer defaults to 512 events (as of PMIx 2.0), but can be > configured using the _PMIx\_server\_cache\_size_ info key during the call to > the _PMIx\_Server\_init_ API. > > Client application processes can also use the PMIx Event Notification system > to request that the resource manager notify its peers of internal events, and > notify the resource manager of events detected by the application process. > Examples of the latter include network communication errors that may not have > been detected by the fabric manager itself (e.g., data corruption). The > client must direct the notification to the appropriate target (RM or peers) > using the corresponding range parameter. > =============================== > > The biggest change for OMPI is that it enables you to register event handlers > for specific error constants - e.g., for knowing when debugger release has > been issued. What you do in response to that notification is totally up to > you, and we do “chain” the handlers (and pass the output of one down to the > following handlers). > > This should not be considered a cast-in-concrete capability - it will evolve > as folks start to use it. However, we believe the interfaces should now be > stable and ready for use. > > The changes include: > > * upgrade the base PMIx installation to 2.0.0a1, tracking (but lagging) the > PMIx master > * creating a PMIx 1.1.4-specific external component for backward compatibility > * adding a PMIx 2.x-specific external component for those wanting to build > directly against the PMIx master > * converting debugger support to use PMIx instead of RML for release. Note > that the OOB/usock component remains for show_help support until the upcoming > PMIx_Log interface is available. > > Please provide any comments or concerns. I’m planning to “hold” this PR a bit > while we resolve the OMPI 2.0 issues. > > Ralph > >