The --no-kill option to salloc keeps the _job_ allocation active if any nodes in that allocation fail, but once mpirun exits, then salloc exits. mpirun (or srun by default) will exit if any nodes in the _step_ allocation fail.

In short, mpirun exited due to the node failure and once mpirun exited, then salloc exited.


Quoting Steven Chow <wulingaoshou_...@163.com>:

Hi,
I am a newer on slurm.
I have a problem about the Failure Tolerance, when I was running a MPI application on a cluster with slurm.


My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
I didn't use plugin Checkpoint or Nonstop.


I submit the job through command "salloc -N 10 --no-kill mpirun ./my-mpi-application".


In the running process, if one node crashed, then the WHOLE job would be killd on all allocated nodes.
It seems that the "--no-kill" option dosen't work.


I want the job continuing running without being killed, even with some nodes failure or network connection broken.
Because  i will handle the nodes failure by myself.


Can anyone give some suggestions.


Besides, if I want to use plugin Nonstop to handle failure, according to http://slurm.schedmd.com/nonstop.html, an additional package named smd will also need to be installed.
How can I get this package?


Thanks!



-Steven Chow


--
Morris "Moe" Jette
CTO, SchedMD LLC

Reply via email to