The --no-kill option to salloc keeps the _job_ allocation active if
any nodes in that allocation fail, but once mpirun exits, then salloc
exits. mpirun (or srun by default) will exit if any nodes in the
_step_ allocation fail.
In short, mpirun exited due to the node failure and once mpirun
exited, then salloc exited.
Quoting Steven Chow <wulingaoshou_...@163.com>:
Hi,
I am a newer on slurm.
I have a problem about the Failure Tolerance, when I was running a
MPI application on a cluster with slurm.
My slurm version is 14.03.6, and the MPI version is OPEN MPI 1.6.5.
I didn't use plugin Checkpoint or Nonstop.
I submit the job through command "salloc -N 10 --no-kill mpirun
./my-mpi-application".
In the running process, if one node crashed, then the WHOLE job
would be killd on all allocated nodes.
It seems that the "--no-kill" option dosen't work.
I want the job continuing running without being killed, even with
some nodes failure or network connection broken.
Because i will handle the nodes failure by myself.
Can anyone give some suggestions.
Besides, if I want to use plugin Nonstop to handle failure,
according to http://slurm.schedmd.com/nonstop.html, an additional
package named smd will also need to be installed.
How can I get this package?
Thanks!
-Steven Chow
--
Morris "Moe" Jette
CTO, SchedMD LLC