OK, I see. I've got SLURM sources with PMI in it and found out the reason of "strange" behavior (I mean 0 rank process behaves different from others in PMI_Abort()). It seems clear to deal with it. Is it a complex procedure to provide a minor fix to community (via patch)?
On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer <[email protected]> wrote: > We are able to use _exit() so I did not go any further. The behavior of > PMI_Abort() and exit() were both odd so I thought that my save you some > time. I am interested if you find another solution. > > > On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov < > [email protected]> wrote: > >> Thank you for the rapid answer! But still I have several questions, >> please see inline. >> >> >> On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer <[email protected]> wrote: >> >>> I addressed a similar problem with _exit(<value>). >>> >> [Victor Kocheganov] Where can I find it? I can not any clue in archive of >> slurm-dev list >> (http://dir.gmane.org/gmane.comp.distributed.slurm.devel<http://dir.gmane.org/gmane.comp.distributed.slurm.devel%22> >> ) >> >> Slurm will kill off the rest of the pe in a job step if one exits with a >>> non-zero code. >>> >> [Victor Kocheganov] Unfortunately it depends on slurm configurations as >> far as I know (whether '-K' flag is set or not; it could be set >> implicitly). So I can not rely on such a behavior... >> >> The exit() function doesn't work under mx shmem because the exit() >>> function is overridden and does not propagate the exit code. >>> PMI_Abort(exit_code) uses exit() so in our case it always returns an exit >>> code of 9 regardless of the value of exit_code. >>> >> [Victor Kocheganov] And this is interesting, because I see that SLURM >> always returns zero value to system when PMI_Abort(0,NULL) was invoked by >> some process, except for the case when process with zero rank (PMI daemon >> as I suspect) invoked it. Therefore a little hope still exists in my mind, >> that I can make PMI_Abort work for me (return zero always in case >> PMI_Abort(0,NULL)). >> >> But you are saying that there is no hope in PMI_Abort(), am I understand >> right? Do you have any other ways to make SLURM ( using PMI or without it) >> terminate all the processes if one of them requested it (with passed exit >> statuses off course)? >> >>> >>> >>> On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> I am SHMEM library developer and I am looking for approach to terminate >>>> the whole slurm job with the specific exit status, when one of processes >>>> initiate it. That is SHMEM library should have some API routine named >>>> 'globalexit(int status);', which terminates the job with other processes in >>>> it with status exit code. >>>> >>>> The only way I found out is to use PMI_Abort(status), but it does not >>>> work for zero status value, when PMI_Abort is invoked by zero process >>>> (daemon for PMI, as I understand). Is it normal behavior or a bug? Could >>>> you please help to find any other approaches, if this one does not seem >>>> proper for slurm? >>>> >>>> Thank you in advance, >>>> Victor Kocheganov. >>>> >>> >>> >>> >>> -- >>> Speak when you are angry--and you will make the best speech you'll ever >>> regret. >>> - Laurence J. Peter >>> >> >> > > > -- > Speak when you are angry--and you will make the best speech you'll ever > regret. > - Laurence J. Peter >
