I addressed a similar problem with _exit(<value>).  Slurm will kill off the
rest of the pe in a job step if one exits with a non-zero code.  The exit()
function doesn't work under mx shmem because the exit() function is
overridden and does not propagate the exit code.  PMI_Abort(exit_code) uses
exit() so in our case it always returns an exit code of 9 regardless of the
value of exit_code.


On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov <
[email protected]> wrote:

>  Hello,
>
> I am SHMEM library developer and I am looking for approach to terminate
> the whole slurm job with the specific exit status, when one of processes
> initiate it. That is SHMEM library should have some API routine named
> 'globalexit(int status);', which terminates the job with other processes in
> it with status exit code.
>
> The only way I found out is to use PMI_Abort(status), but it does not work
> for zero status value, when PMI_Abort is invoked by zero process (daemon
> for PMI, as I understand). Is it normal behavior or a bug? Could you please
> help to find any other approaches, if this one does not seem proper for
> slurm?
>
> Thank you in advance,
> Victor Kocheganov.
>



-- 
Speak when you are angry--and you will make the best speech you'll ever
regret.
  - Laurence J. Peter

Reply via email to