OK, I see.

I've got SLURM sources with PMI in it and found out the reason of "strange"
behavior (I mean 0 rank process behaves different from others in
PMI_Abort()).
It seems clear to deal with it. Is it a complex procedure to provide a
minor fix to community (via patch)?


On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer <[email protected]> wrote:

>  We are able to use _exit() so I did not go any further.  The behavior of
> PMI_Abort() and exit() were both odd so I thought that my save you some
> time.  I am interested if you find another solution.
>
>
> On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov <
> [email protected]> wrote:
>
>>  Thank you for the rapid answer! But still I have several questions,
>> please see inline.
>>
>>
>> On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer <[email protected]> wrote:
>>
>>>  I addressed a similar problem with _exit(<value>).
>>>
>> [Victor Kocheganov] Where can I find it? I can not any clue in archive of
>> slurm-dev list 
>> (http://dir.gmane.org/gmane.comp.distributed.slurm.devel<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22>
>> )
>>
>> Slurm will kill off the rest of the pe in a job step if one exits with a
>>> non-zero code.
>>>
>> [Victor Kocheganov] Unfortunately it depends on slurm configurations as
>> far as I know (whether '-K' flag is set or not; it could be set
>> implicitly). So I can not rely on such a behavior...
>>
>>  The exit() function doesn't work under mx shmem because the exit()
>>> function is overridden and does not propagate the exit code.
>>> PMI_Abort(exit_code) uses exit() so in our case it always returns an exit
>>> code of 9 regardless of the value of exit_code.
>>>
>> [Victor Kocheganov] And this is interesting, because I see that SLURM
>> always returns zero value to system when PMI_Abort(0,NULL) was invoked by
>> some process, except for the case when process with zero rank (PMI daemon
>> as I suspect) invoked it. Therefore a little hope still exists in my mind,
>> that I can make PMI_Abort work for me (return zero always in case
>> PMI_Abort(0,NULL)).
>>
>> But you are saying that there is no hope in PMI_Abort(), am I understand
>> right? Do you have any other ways to make SLURM ( using PMI or without it)
>> terminate all the processes if one of them requested it (with passed exit
>> statuses off course)?
>>
>>>
>>>
>>> On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov <
>>> [email protected]> wrote:
>>>
>>>>  Hello,
>>>>
>>>> I am SHMEM library developer and I am looking for approach to terminate
>>>> the whole slurm job with the specific exit status, when one of processes
>>>> initiate it. That is SHMEM library should have some API routine named
>>>> 'globalexit(int status);', which terminates the job with other processes in
>>>> it with status exit code.
>>>>
>>>> The only way I found out is to use PMI_Abort(status), but it does not
>>>> work for zero status value, when PMI_Abort is invoked by zero process
>>>> (daemon for PMI, as I understand). Is it normal behavior or a bug? Could
>>>> you please help to find any other approaches, if this one does not seem
>>>> proper for slurm?
>>>>
>>>> Thank you in advance,
>>>> Victor Kocheganov.
>>>>
>>>
>>>
>>>
>>> --
>>> Speak when you are angry--and you will make the best speech you'll ever
>>> regret.
>>>   - Laurence J. Peter
>>>
>>
>>
>
>
> --
> Speak when you are angry--and you will make the best speech you'll ever
> regret.
>   - Laurence J. Peter
>

Reply via email to