Cool -- we'll try this. Thank you!
On Feb 6, 2013, at 1:36 PM, Moe Jette <[email protected]> wrote: > > One more thing. Configure FastSchedule=2 to avoid having the node > marked DOWN due to an unexpected reboot. > > Quoting Moe Jette <[email protected]>: > >> >> Configure SlurmdTimeout sufficiently large and you should be fine >> _except_ when the node running a batch script reboots that job will be >> killed. >> >> Quoting "Jeff Squyres (jsquyres)" <[email protected]>: >> >>> >>> Is there a mode in SLURM where I can make it ok to reboot nodes >>> during a job? >>> >>> Specifically, we want to use SLURM to manage a QA cluster here in >>> Cisco. Some of the things that we need QA jobs to do is actually >>> reboot nodes -- but we don't want the SLURM job to end because the >>> job rebooted; the reboot was part of the job. We want the node to >>> reboot and have SLURM say "oh, ok, you're back -- you can re-join >>> the job now." >>> >>> Is that possible? >>> >>> -- >>> Jeff Squyres >>> [email protected] >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >> >> > -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
