Hi,
    have a look at ReturnToService in 'man slurm.conf'.

/David


On Fri, Mar 1, 2013 at 9:12 PM, Tim <[email protected]> wrote:

>  Hi folks,
>
> I've having some concern with some of my slurm nodes in that the slighest
> thing seems to make them go into a down state. Low CPUs, node unexpectedly
> restarted, etc.
>
> I was wondering, is there a way to just turn this off?
>
> Let's say that I have my farm designed in such a way to tolerate failures.
> Node fails for whatever reason, fine, stand up another node, move on. Or
> fine, resume the software and away you go.
>
> Is there a way to configure slurm to just tolerate these errors and resume
> the node so it can keep churning away? Or do I have to manually do a
>
>     scontrol update Node=my_node State=RESUME
>
> when it fails.
>
> Thanks in advance for any insight!
>
> -Tim
>

Reply via email to