Hi, I am a newer on slurm. I have a problem about the Failure Tolerance, when I was running a MPI application on a cluster with slurm.
My slurm version is 14.03.6, and the MPI version is OPEN MPI 1.6.5. I didn't use plugin Checkpoint or Nonstop. I submit the job through command "salloc -N 10 --no-kill mpirun ./my-mpi-application". In the running process, if one node crashed, then the WHOLE job would be killd on all allocated nodes. It seems that the "--no-kill" option dosen't work. I want the job continuing running without being killed, even with some nodes failure or network connection broken. Because i will handle the nodes failure by myself. Can anyone give some suggestions. Besides, if I want to use plugin Nonstop to handle failure, according to http://slurm.schedmd.com/nonstop.html, an additional package named smd will also need to be installed. How can I get this package? Thanks! -Steven Chow