Hi Victor,

If the patch is straight-forward, and the reason for it is clear, patches sent 
to this list tend to be adopted quickly. However, since this changes behavior 
that someone else may be counting on, it might get held for the next major 
release if it is accepted.

Andy

From: Victor Kocheganov [mailto:[email protected]]
Sent: Monday, June 03, 2013 6:48 AM
To: slurm-dev
Subject: [slurm-dev] Re: PMI_Abort with zero value

OK, I see.

I've got SLURM sources with PMI in it and found out the reason of "strange" 
behavior (I mean 0 rank process behaves different from others in PMI_Abort()).
It seems clear to deal with it. Is it a complex procedure to provide a minor 
fix to community (via patch)?

On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer 
<[email protected]<mailto:[email protected]>> wrote:
We are able to use _exit() so I did not go any further.  The behavior of 
PMI_Abort() and exit() were both odd so I thought that my save you some time.  
I am interested if you find another solution.

On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov 
<[email protected]<mailto:[email protected]>> wrote:
Thank you for the rapid answer! But still I have several questions, please see 
inline.

On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer 
<[email protected]<mailto:[email protected]>> wrote:
I addressed a similar problem with _exit(<value>).
[Victor Kocheganov] Where can I find it? I can not any clue in archive of 
slurm-dev list 
(http://dir.gmane.org/gmane.comp.distributed.slurm.devel<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22>)

Slurm will kill off the rest of the pe in a job step if one exits with a 
non-zero code.
[Victor Kocheganov] Unfortunately it depends on slurm configurations as far as 
I know (whether '-K' flag is set or not; it could be set implicitly). So I can 
not rely on such a behavior...

The exit() function doesn't work under mx shmem because the exit() function is 
overridden and does not propagate the exit code.  PMI_Abort(exit_code) uses 
exit() so in our case it always returns an exit code of 9 regardless of the 
value of exit_code.
[Victor Kocheganov] And this is interesting, because I see that SLURM always 
returns zero value to system when PMI_Abort(0,NULL) was invoked by some 
process, except for the case when process with zero rank (PMI daemon as I 
suspect) invoked it. Therefore a little hope still exists in my mind, that I 
can make PMI_Abort work for me (return zero always in case PMI_Abort(0,NULL)).

But you are saying that there is no hope in PMI_Abort(), am I understand right? 
Do you have any other ways to make SLURM ( using PMI or without it) terminate 
all the processes if one of them requested it (with passed exit statuses off 
course)?

On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I am SHMEM library developer and I am looking for approach to terminate the 
whole slurm job with the specific exit status, when one of processes initiate 
it. That is SHMEM library should have some API routine named 'globalexit(int 
status);', which terminates the job with other processes in it with status exit 
code.

The only way I found out is to use PMI_Abort(status), but it does not work for 
zero status value, when PMI_Abort is invoked by zero process (daemon for PMI, 
as I understand). Is it normal behavior or a bug? Could you please help to find 
any other approaches, if this one does not seem proper for slurm?

Thank you in advance,
Victor Kocheganov.



--
Speak when you are angry--and you will make the best speech you'll ever regret.
  - Laurence J. Peter




--
Speak when you are angry--and you will make the best speech you'll ever regret.
  - Laurence J. Peter


Reply via email to