I addressed a similar problem with _exit(<value>). Slurm will kill off the rest of the pe in a job step if one exits with a non-zero code. The exit() function doesn't work under mx shmem because the exit() function is overridden and does not propagate the exit code. PMI_Abort(exit_code) uses exit() so in our case it always returns an exit code of 9 regardless of the value of exit_code.
On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov < [email protected]> wrote: > Hello, > > I am SHMEM library developer and I am looking for approach to terminate > the whole slurm job with the specific exit status, when one of processes > initiate it. That is SHMEM library should have some API routine named > 'globalexit(int status);', which terminates the job with other processes in > it with status exit code. > > The only way I found out is to use PMI_Abort(status), but it does not work > for zero status value, when PMI_Abort is invoked by zero process (daemon > for PMI, as I understand). Is it normal behavior or a bug? Could you please > help to find any other approaches, if this one does not seem proper for > slurm? > > Thank you in advance, > Victor Kocheganov. > -- Speak when you are angry--and you will make the best speech you'll ever regret. - Laurence J. Peter
