One more thing. Configure FastSchedule=2 to avoid having the node marked DOWN due to an unexpected reboot.
Quoting Moe Jette <[email protected]>: > > Configure SlurmdTimeout sufficiently large and you should be fine > _except_ when the node running a batch script reboots that job will be > killed. > > Quoting "Jeff Squyres (jsquyres)" <[email protected]>: > >> >> Is there a mode in SLURM where I can make it ok to reboot nodes >> during a job? >> >> Specifically, we want to use SLURM to manage a QA cluster here in >> Cisco. Some of the things that we need QA jobs to do is actually >> reboot nodes -- but we don't want the SLURM job to end because the >> job rebooted; the reboot was part of the job. We want the node to >> reboot and have SLURM say "oh, ok, you're back -- you can re-join >> the job now." >> >> Is that possible? >> >> -- >> Jeff Squyres >> [email protected] >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >
