Configure SlurmdTimeout sufficiently large and you should be fine _except_ when the node running a batch script reboots that job will be killed.
Quoting "Jeff Squyres (jsquyres)" <[email protected]>: > > Is there a mode in SLURM where I can make it ok to reboot nodes during a job? > > Specifically, we want to use SLURM to manage a QA cluster here in > Cisco. Some of the things that we need QA jobs to do is actually > reboot nodes -- but we don't want the SLURM job to end because the > job rebooted; the reboot was part of the job. We want the node to > reboot and have SLURM say "oh, ok, you're back -- you can re-join > the job now." > > Is that possible? > > -- > Jeff Squyres > [email protected] > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ >
