The corresponding PMIx RFC is now available for comment: 
https://github.com/pmix/RFCs/pull/2 <https://github.com/pmix/RFCs/pull/2>


> On Jun 9, 2016, at 8:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Hi folks
> 
> There is a PR that has cleared Jenkins, but it represents a fairly 
> significant change in OMPI capabilities. Thus, I think it merits a little 
> more attention.
> 
> The PR (https://github.com/open-mpi/ompi/pull/1767 
> <https://github.com/open-mpi/ompi/pull/1767>) brings the PMIx event 
> notification system into OMPI. Quoting from the PMIx RFC:
> 
> ===============================
> The PMIx Event Notification system provides a mechanism by which the resource 
> manager can communicate system events to applications, thus providing 
> applications with an opportunity to generate an appropriate response. In 
> addition, applications can use the system to request that the resource 
> manager notify their peers of internal events (e.g., computational errors and 
> aborted operations), and notify the resource manager of events detected by 
> the application.
> 
> The resource manager will be aware of a wide range of events that occur 
> across the system. For the purposes of this discussion, only events that 
> impact the allocated session being served by the PMIx server are considered. 
> These events can be divided into two distinct classes:
> 
> * Job-specific events that directly relate to a job executing within the
>   session. This might include events such as debugger attachment or process 
> failure within a related job. These events are characterized by directly 
> targeting processes within session jobs - i.e., the "procs" parameter of the 
> notification contain members of a job executing within the session. Events in 
> this category are to be immediately delivered to the PMIx server library for 
> delivery to the specified processes.
> 
>   Clients can indicate a desire to register solely for job-specific events by 
> including the _PMIX\_EVENT\_JOB\_LEVEL_ key in their call to 
> _PMIx\_Register\_event_ - i.e., providing this key will explicitly indicate 
> that environment events are _not_ to be reported to this callback function.
> 
> * Environment events that impact the session, but are not directly sent to
>   executing jobs. This is a much broader category of events that includes ECC 
> errors, temperature excursions, and other environmental events directly 
> affecting the session's resources. Note that although these do impact the 
> session's jobs, they are not directly referencing those jobs - i.e., the 
> event is generated without specifying a particular target. Thus, events in 
> this category are to be delivered to the PMIx server library only upon 
> request - i.e., when the PMIx server has registered for those events.
> 
> Note that race conditions can cause the registration to come _after_ events 
> of possible interest (e.g., a memory ECC event that occurs after start of 
> execution but prior to registration). RMs are free to cache events in this 
> category for some time to mitigate this situation, but are not required to do 
> so. Thus, applications must be aware that environment events prior to 
> registration may not be included in notifications.
> 
> As above, clients can indicate a desire to register solely for environment 
> events of a given type by include the _PMIX\_EVENT\_ENVIRO\_LEVEL_ key in 
> their registration call.
> 
> The PMIx server will cache any environment events passed to it for a period 
> of time to provide notification to clients that have not yet registered for 
> them. Currently, the PMIx server uses a ring buffer to cache events. The size 
> of the ring buffer defaults to 512 events (as of PMIx 2.0), but can be 
> configured using the _PMIx\_server\_cache\_size_ info key during the call to 
> the _PMIx\_Server\_init_ API.
> 
> Client application processes can also use the PMIx Event Notification system 
> to request that the resource manager notify its peers of internal events, and 
> notify the resource manager of events detected by the application process. 
> Examples of the latter include network communication errors that may not have 
> been detected by the fabric manager itself (e.g., data corruption). The 
> client must direct the notification to the appropriate target (RM or peers) 
> using the corresponding range parameter.
> ===============================
> 
> The biggest change for OMPI is that it enables you to register event handlers 
> for specific error constants - e.g., for knowing when debugger release has 
> been issued. What you do in response to that notification is totally up to 
> you, and we do “chain” the handlers (and pass the output of one down to the 
> following handlers).
> 
> This should not be considered a cast-in-concrete capability - it will evolve 
> as folks start to use it. However, we believe the interfaces should now be 
> stable and ready for use.
> 
> The changes include:
> 
> * upgrade the base PMIx installation to 2.0.0a1, tracking (but lagging) the 
> PMIx master
> * creating a PMIx 1.1.4-specific external component for backward compatibility
> * adding a PMIx 2.x-specific external component for those wanting to build 
> directly against the PMIx master
> * converting debugger support to use PMIx instead of RML for release. Note 
> that the OOB/usock component remains for show_help support until the upcoming 
> PMIx_Log interface is available.
> 
> Please provide any comments or concerns. I’m planning to “hold” this PR a bit 
> while we resolve the OMPI 2.0 issues.
> 
> Ralph
> 
> 

Reply via email to