The PMI group is handled elsewhere, but I agree that we need a different
API for this purpose. We see conditional execution in Hadoop and elsewhere,
so it is a reasonable use-case - just need to leave "abort" as something
different. I'd propose we create something like PMI_Terminate as a separate
way of implementing it.

I'll pose the question to the PMI folks and post the conclusion here as
well.
Ralph



On Tue, Jun 4, 2013 at 6:52 AM, Andy Riebs <[email protected]> wrote:

>  I don't know the provenance of the PMI specification, but would it be
> possible to add a new function (at least within SLURM's PMI implementation)
> with the effect that Victor describes? Legacy SHMEM implementations have
> provided globalexit() and, should OpenSHMEM evolve to include it, it will
> likely have the semantics that globalexit(0) should cause the launcher to
> exit with 0.
>
> Andy
>
>
> On 06/04/2013 07:59 AM, Victor Kocheganov wrote:
>
>  OK, I see your points: I did not suspect it would be so inconvenient to
> have such a behavior, but all the reasons look convenient. The source of
> requirement is just "will" of our users.
> Will try to find another approach then.
>
>  Thanks for detail explanation, Ralph!
>
>
> On Tue, Jun 4, 2013 at 3:34 AM, Ralph Castain <[email protected]> wrote:
>
>>  The OMPI developers were meeting this afternoon, so we took advantage
>> of it to discuss this topic. We would recommend not changing the current
>> behavior for two reasons. First, there is a long precedent for returning
>> the first non-zero status, and returning a non-zero status if any process
>> causes the entire job to abort even if they all abort with status zero.
>> This is the only way the user (and any script they are using) can know that
>> an "abort" was ordered.
>>
>>  Second, we have looked at the OpenShmem standard and confirmed that
>> nothing is said there about returning zero status in such situations. We
>> don't know the source of this proposed requirement, but feel that it
>> shouldn't override the community's expected behavior.
>>
>>  Just our $0.02
>> Ralph
>>
>>
>>
>> On Mon, Jun 3, 2013 at 1:45 PM, Ralph Castain <[email protected]> wrote:
>>
>>>  I'm leery of this patch - will discuss with other MPI folks as this
>>> could cause problems for existing apps
>>>
>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jun 3, 2013, at 5:17 AM, "Riebs, Andy" <[email protected]> wrote:
>>>>
>>>>     Hi Victor,
>>>>
>>>>
>>>>
>>>> If the patch is straight-forward, and the reason for it is clear,
>>>> patches sent to this list tend to be adopted quickly. However, since this
>>>> changes behavior that someone else may be counting on, it might get held
>>>> for the next major release if it is accepted.
>>>>
>>>>
>>>>
>>>> Andy
>>>>
>>>>
>>>>
>>>> *From:* Victor Kocheganov 
>>>> [mailto:[email protected]<[email protected]>]
>>>>
>>>> *Sent:* Monday, June 03, 2013 6:48 AM
>>>> *To:* slurm-dev
>>>> *Subject:* [slurm-dev] Re: PMI_Abort with zero value
>>>>
>>>>
>>>>
>>>> OK, I see.
>>>>
>>>>
>>>>
>>>> I've got SLURM sources with PMI in it and found out the reason of
>>>> "strange" behavior (I mean 0 rank process behaves different from others in
>>>> PMI_Abort()).
>>>>
>>>> It seems clear to deal with it. Is it a complex procedure to provide a
>>>> minor fix to community (via patch)?
>>>>
>>>>
>>>>
>>>> On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer <[email protected]>
>>>> wrote:
>>>>
>>>> We are able to use _exit() so I did not go any further.  The behavior
>>>> of PMI_Abort() and exit() were both odd so I thought that my save you some
>>>> time.  I am interested if you find another solution.
>>>>
>>>>
>>>>
>>>> On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov <
>>>> [email protected]> wrote:
>>>>
>>>> Thank you for the rapid answer! But still I have several questions,
>>>> please see inline.
>>>>
>>>>
>>>>
>>>> On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer <[email protected]>
>>>> wrote:
>>>>
>>>> I addressed a similar problem with _exit(<value>).
>>>>
>>>> [Victor Kocheganov] Where can I find it? I can not any clue in archive
>>>> of slurm-dev list (
>>>> http://dir.gmane.org/gmane.comp.distributed.slurm.devel<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22>
>>>> )
>>>>
>>>>
>>>>
>>>>  Slurm will kill off the rest of the pe in a job step if one exits
>>>> with a non-zero code.
>>>>
>>>>  [Victor Kocheganov] Unfortunately it depends on slurm configurations
>>>> as far as I know (whether '-K' flag is set or not; it could be set
>>>> implicitly). So I can not rely on such a behavior...
>>>>
>>>>
>>>>
>>>>  The exit() function doesn't work under mx shmem because the exit()
>>>> function is overridden and does not propagate the exit code.
>>>> PMI_Abort(exit_code) uses exit() so in our case it always returns an exit
>>>> code of 9 regardless of the value of exit_code.
>>>>
>>>>  [Victor Kocheganov] And this is interesting, because I see that SLURM
>>>> always returns zero value to system when PMI_Abort(0,NULL) was invoked by
>>>> some process, except for the case when process with zero rank (PMI daemon
>>>> as I suspect) invoked it. Therefore a little hope still exists in my mind,
>>>> that I can make PMI_Abort work for me (return zero always in
>>>> case PMI_Abort(0,NULL)).
>>>>
>>>>
>>>>
>>>> But you are saying that there is no hope in PMI_Abort(), am I
>>>> understand right? Do you have any other ways to make SLURM ( using PMI or
>>>> without it) terminate all the processes if one of them requested it (with
>>>> passed exit statuses off course)?
>>>>
>>>>
>>>>
>>>> On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov <
>>>> [email protected]> wrote:
>>>>
>>>>   Hello,
>>>>
>>>>
>>>>
>>>> I am SHMEM library developer and I am looking for approach to terminate
>>>> the whole slurm job with the specific exit status, when one of processes
>>>> initiate it. That is SHMEM library should have some API routine named
>>>> 'globalexit(int status);', which terminates the job with other processes in
>>>> it with status exit code.
>>>>
>>>>
>>>>
>>>> The only way I found out is to use PMI_Abort(status), but it does not
>>>> work for zero status value, when PMI_Abort is invoked by zero process
>>>> (daemon for PMI, as I understand). Is it normal behavior or a bug? Could
>>>> you please help to find any other approaches, if this one does not seem
>>>> proper for slurm?
>>>>
>>>>
>>>>
>>>> Thank you in advance,
>>>>
>>>> Victor Kocheganov.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Speak when you are angry--and you will make the best speech you'll ever
>>>> regret.
>>>>   - Laurence J. Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Speak when you are angry--and you will make the best speech you'll ever
>>>> regret.
>>>>   - Laurence J. Peter
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
> --
> Andy Riebs
> Hewlett-Packard Company
> High Performance Computing+1 404 648 9024
> My opinions are not necessarily those of HP
>
>

Reply via email to