OK, I see your points: I did not suspect it would be so inconvenient to
have such a behavior, but all the reasons look convenient. The source of
requirement is just "will" of our users.
Will try to find another approach then.

Thanks for detail explanation, Ralph!


On Tue, Jun 4, 2013 at 3:34 AM, Ralph Castain <[email protected]> wrote:

>  The OMPI developers were meeting this afternoon, so we took advantage of
> it to discuss this topic. We would recommend not changing the current
> behavior for two reasons. First, there is a long precedent for returning
> the first non-zero status, and returning a non-zero status if any process
> causes the entire job to abort even if they all abort with status zero.
> This is the only way the user (and any script they are using) can know that
> an "abort" was ordered.
>
> Second, we have looked at the OpenShmem standard and confirmed that
> nothing is said there about returning zero status in such situations. We
> don't know the source of this proposed requirement, but feel that it
> shouldn't override the community's expected behavior.
>
> Just our $0.02
> Ralph
>
>
>
> On Mon, Jun 3, 2013 at 1:45 PM, Ralph Castain <[email protected]> wrote:
>
>>  I'm leery of this patch - will discuss with other MPI folks as this
>> could cause problems for existing apps
>>
>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Jun 3, 2013, at 5:17 AM, "Riebs, Andy" <[email protected]> wrote:
>>>
>>>   Hi Victor,****
>>>
>>> ** **
>>>
>>> If the patch is straight-forward, and the reason for it is clear,
>>> patches sent to this list tend to be adopted quickly. However, since this
>>> changes behavior that someone else may be counting on, it might get held
>>> for the next major release if it is accepted.****
>>>
>>> ** **
>>>
>>> Andy****
>>>
>>> ** **
>>>
>>> *From:* Victor Kocheganov 
>>> [mailto:[email protected]<[email protected]>]
>>>
>>> *Sent:* Monday, June 03, 2013 6:48 AM
>>> *To:* slurm-dev
>>> *Subject:* [slurm-dev] Re: PMI_Abort with zero value****
>>>
>>> ** **
>>>
>>> OK, I see.****
>>>
>>> ** **
>>>
>>> I've got SLURM sources with PMI in it and found out the reason of
>>> "strange" behavior (I mean 0 rank process behaves different from others in
>>> PMI_Abort()). ****
>>>
>>> It seems clear to deal with it. Is it a complex procedure to provide a
>>> minor fix to community (via patch)?****
>>>
>>> ** **
>>>
>>> On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer <[email protected]>
>>> wrote:****
>>>
>>> We are able to use _exit() so I did not go any further.  The behavior of
>>> PMI_Abort() and exit() were both odd so I thought that my save you some
>>> time.  I am interested if you find another solution.  ****
>>>
>>> ** **
>>>
>>> On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov <
>>> [email protected]> wrote:****
>>>
>>> Thank you for the rapid answer! But still I have several questions,
>>> please see inline.****
>>>
>>> ** **
>>>
>>> On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer <[email protected]>
>>> wrote:****
>>>
>>> I addressed a similar problem with _exit(<value>).  ****
>>>
>>> [Victor Kocheganov] Where can I find it? I can not any clue in archive
>>> of slurm-dev list (
>>> http://dir.gmane.org/gmane.comp.distributed.slurm.devel<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22>
>>> )****
>>>
>>> ** **
>>>
>>>  Slurm will kill off the rest of the pe in a job step if one exits with
>>> a non-zero code. ****
>>>
>>>  [Victor Kocheganov] Unfortunately it depends on slurm configurations
>>> as far as I know (whether '-K' flag is set or not; it could be set
>>> implicitly). So I can not rely on such a behavior...****
>>>
>>> ** **
>>>
>>>  The exit() function doesn't work under mx shmem because the exit()
>>> function is overridden and does not propagate the exit code.
>>> PMI_Abort(exit_code) uses exit() so in our case it always returns an exit
>>> code of 9 regardless of the value of exit_code.****
>>>
>>>  [Victor Kocheganov] And this is interesting, because I see that SLURM
>>> always returns zero value to system when PMI_Abort(0,NULL) was invoked by
>>> some process, except for the case when process with zero rank (PMI daemon
>>> as I suspect) invoked it. Therefore a little hope still exists in my mind,
>>> that I can make PMI_Abort work for me (return zero always in
>>> case PMI_Abort(0,NULL)).****
>>>
>>> ** **
>>>
>>> But you are saying that there is no hope in PMI_Abort(), am I understand
>>> right? Do you have any other ways to make SLURM ( using PMI or without it)
>>> terminate all the processes if one of them requested it (with passed exit
>>> statuses off course)? ****
>>>
>>>   ** **
>>>
>>> On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov <
>>> [email protected]> wrote:****
>>>
>>>   Hello,****
>>>
>>> ** **
>>>
>>> I am SHMEM library developer and I am looking for approach to terminate
>>> the whole slurm job with the specific exit status, when one of processes
>>> initiate it. That is SHMEM library should have some API routine named
>>> 'globalexit(int status);', which terminates the job with other processes in
>>> it with status exit code.****
>>>
>>> ** **
>>>
>>> The only way I found out is to use PMI_Abort(status), but it does not
>>> work for zero status value, when PMI_Abort is invoked by zero process
>>> (daemon for PMI, as I understand). Is it normal behavior or a bug? Could
>>> you please help to find any other approaches, if this one does not seem
>>> proper for slurm?****
>>>
>>> ** **
>>>
>>> Thank you in advance,****
>>>
>>> Victor Kocheganov.****
>>>
>>> ****
>>>
>>>
>>>
>>>
>>> --
>>> Speak when you are angry--and you will make the best speech you'll ever
>>> regret.
>>>   - Laurence J. Peter ****
>>>
>>> ****
>>>
>>> ** **
>>>
>>> ****
>>>
>>>
>>>
>>> ****
>>>
>>> ** **
>>>
>>> --
>>> Speak when you are angry--and you will make the best speech you'll ever
>>> regret.
>>>   - Laurence J. Peter ****
>>>
>>> ****
>>>
>>> ** **
>>>
>>> ****
>>>
>>>
>>
>

Reply via email to