Re: [OMPI devel] RFC: revised ORTE error handling

2013-07-15 Thread George Bosilca
Ralph,

Sorry for the late answer, we have quite a few things on our todo list right 
now. Here are few concerns I'm having about the proposed approach.

1. We would have preferred to have a list of processes for the 
ompi_errhandler_runtime_callback function. We don't necessary care about the 
error code, but having a list will allow us to move the notifications per bulk 
instead of one by one.

2. You made the registration of the callbacks ordered, and added special 
arguments to append or prepend callbacks to the list. Right now I can't figure 
out a good reason on how to use it especially that the order might be impose on 
the order the modules are loaded by the frameworks, thus not something we can 
easily control.

3. The callback list. The concept is useful, I don't know about the 
implementation. The current version doesn't support stopping the propagation of 
the error signal, which might be an issue in some cases. I can picture the fact 
that one level know about the issue, and know how to fix it, so the error does 
not need to propagate to other levels. This can be implemented in the old way 
interrupts were managed in DOS, with basically a simple _get / _set type of 
interface. If a callback wants to propagate the error it has first to retrieve 
the ancestor on the moment when it registered the callback and then explicitly 
calls it upon error.

Again, nothing major in the short term as it will take a significant amount of 
work to move the only user of such error handling capability (the FT prototype) 
back over the current version of the ORTE.

Regards,
  George.



On Jul 3, 2013, at 06:45 , Ralph Castain  wrote:

>  NOTICE: This RFC modifies the MPI-RTE interface 
> 
> WHAT: revise the RTE error handling to allow registration of callbacks upon 
> RTE-detected errors
> 
> WHY: currently, the RTE aborts the process if an RTE-detected error occurs. 
> This allows the upper layers (e.g., MPI) no chance to implement their own 
> error response strategy, and it precludes allowing user-defined error 
> handling.
> 
> TIMEOUT:  let's go for July 19th, pending further discussion
> 
> George and I were talking about ORTE's error handling the other day in 
> regards to the right way to deal with errors in the updated OOB. 
> Specifically, it seemed a bad idea for a library such as ORTE to be aborting 
> the job on its own prerogative. If we lose a connection or cannot send a 
> message, then we really should just report it upwards and let the application 
> and/or upper layers decide what to do about it.
> 
> The current code base only allows a single error callback to exist, which 
> seemed unduly limiting. So, based on the conversation, I've modified the 
> errmgr interface to provide a mechanism for registering any number of error 
> handlers (this replaces the current "set_fault_callback" API). When an error 
> occurs, these handlers will be called in order until one responds that the 
> error has been "resolved" - i.e., no further action is required. The default 
> MPI layer error handler is specified to go "last" and calls mpi_abort, so the 
> current "abort" behavior is preserved unless other error handlers are 
> registered.
> 
> In the register_callback function, I provide an "order" param so you can 
> specify "this callback must come first" or "this callback must come last". 
> Seemed to me that we will probably have different code areas registering 
> callbacks, and one might require it go first (the default "abort" will always 
> require it go last). So you can append and prepend, or go first/last.
> 
> The errhandler callback function passes the name of the proc involved (which 
> can be yourself for internal errors) and the error code. This is a change 
> from the current fault callback which returned an opal_pointer_array of 
> process names.
> 
> The work is available for review in my bitbucket:
> 
> https://bitbucket.org/rhc/ompi-errmgr
> 
> I've attached the svn diff as well.
> 
> Appreciate your comments - nothing in concrete.
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] RFC: revised ORTE error handling

2013-07-03 Thread Ralph Castain
 NOTICE: This RFC modifies the MPI-RTE interface WHAT: revise the RTE error handling to allow registration of callbacks upon RTE-detected errorsWHY: currently, the RTE aborts the process if an RTE-detected error occurs. This allows the upper layers (e.g., MPI) no chance to implement their own error response strategy, and it precludes allowing user-defined error handling.TIMEOUT:  let's go for July 19th, pending further discussionGeorge and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it.The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered.In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first/last.The errhandler callback function passes the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of process names.The work is available for review in my bitbucket:https://bitbucket.org/rhc/ompi-errmgrI've attached the svn diff as well.Appreciate your comments - nothing in concrete.Ralph

err.diff
Description: Binary data