One more thing. Configure FastSchedule=2 to avoid having the node  
marked DOWN due to an unexpected reboot.

Quoting Moe Jette <[email protected]>:

>
> Configure SlurmdTimeout sufficiently large and you should be fine
> _except_ when the node running a batch script reboots that job will be
> killed.
>
> Quoting "Jeff Squyres (jsquyres)" <[email protected]>:
>
>>
>> Is there a mode in SLURM where I can make it ok to reboot nodes  
>> during a job?
>>
>> Specifically, we want to use SLURM to manage a QA cluster here in
>> Cisco.  Some of the things that we need QA jobs to do is actually
>> reboot nodes -- but we don't want the SLURM job to end because the
>> job rebooted; the reboot was part of the job.  We want the node to
>> reboot and have SLURM say "oh, ok, you're back -- you can re-join
>> the job now."
>>
>> Is that possible?
>>
>> --
>> Jeff Squyres
>> [email protected]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>
>

Reply via email to