Hi Victor,
If the patch is straight-forward, and the reason for it
is clear, patches sent to this list tend to be adopted
quickly. However, since this changes behavior that
someone else may be counting on, it might get held for
the next major release if it is accepted.
Andy
*From:*Victor Kocheganov
[mailto:[email protected]]
*Sent:* Monday, June 03, 2013 6:48 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: PMI_Abort with zero value
OK, I see.
I've got SLURM sources with PMI in it and found out the
reason of "strange" behavior (I mean 0 rank process
behaves different from others in PMI_Abort()).
It seems clear to deal with it. Is it a complex procedure
to provide a minor fix to community (via patch)?
On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer
<[email protected] <mailto:[email protected]>> wrote:
We are able to use _exit() so I did not go any further.
The behavior of PMI_Abort() and exit() were both odd so
I thought that my save you some time. I am interested if
you find another solution.
On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov
<[email protected]
<mailto:[email protected]>> wrote:
Thank you for the rapid answer! But still I have several
questions, please see inline.
On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer
<[email protected] <mailto:[email protected]>> wrote:
I addressed a similar problem with _exit(<value>).
[Victor Kocheganov] Where can I find it? I can not any
clue in archive of slurm-dev list
(http://dir.gmane.org/gmane.comp.distributed.slurm.devel
<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22>)
Slurm will kill off the rest of the pe in a job step
if one exits with a non-zero code.
[Victor Kocheganov] Unfortunately it depends on slurm
configurations as far as I know (whether '-K' flag is set
or not; it could be set implicitly). So I can not rely on
such a behavior...
The exit() function doesn't work under mx shmem
because the exit() function is overridden and does
not propagate the exit code. PMI_Abort(exit_code)
uses exit() so in our case it always returns an exit
code of 9 regardless of the value of exit_code.
[Victor Kocheganov] And this is interesting, because I
see that SLURM always returns zero value to system when
PMI_Abort(0,NULL) was invoked by some process, except for
the case when process with zero rank (PMI daemon as I
suspect) invoked it. Therefore a little hope still exists
in my mind, that I can make PMI_Abort work for me (return
zero always in case PMI_Abort(0,NULL)).
But you are saying that there is no hope in PMI_Abort(),
am I understand right? Do you have any other ways to make
SLURM ( using PMI or without it) terminate all the
processes if one of them requested it (with passed exit
statuses off course)?
On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov
<[email protected]
<mailto:[email protected]>> wrote:
Hello,
I am SHMEM library developer and I am looking for
approach to terminate the whole slurm job with
the specific exit status, when one of processes
initiate it. That is SHMEM library should have
some API routine named 'globalexit(int status);',
which terminates the job with other processes in
it with status exit code.
The only way I found out is to use
PMI_Abort(status), but it does not work for zero
status value, when PMI_Abort is invoked by zero
process (daemon for PMI, as I understand). Is it
normal behavior or a bug? Could you please help
to find any other approaches, if this one does
not seem proper for slurm?
Thank you in advance,
Victor Kocheganov.
--
Speak when you are angry--and you will make the best
speech you'll ever regret.
- Laurence J. Peter
--
Speak when you are angry--and you will make the best
speech you'll ever regret.
- Laurence J. Peter