Hi,
I am a newer on slurm. 
I have a problem about the Failure Tolerance, when I was running a MPI 
application on a cluster with slurm. 


My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
I didn't use plugin Checkpoint or Nonstop.


I submit the job through command "salloc -N 10 --no-kill  mpirun 
./my-mpi-application".


In the running process, if one node crashed, then the WHOLE job would be killd 
on all allocated nodes.
It seems that the "--no-kill" option dosen't work.


 I want the job continuing running without being killed, even with some nodes 
failure or network connection broken. 
Because  i will handle the nodes failure by myself.


Can anyone give some suggestions.


Besides, if I want to use plugin  Nonstop to handle failure, according to 
http://slurm.schedmd.com/nonstop.html,  an additional package named smd will 
also need to be installed. 
How can I get this package?


Thanks!



-Steven Chow

Reply via email to