This is controlled by the slurm.conf directive 'ReturnToService'. From https://computing.llnl.gov/linux/slurm/slurm.conf.html
*ReturnToService* Controls when a DOWN node will be returned to service. The default value is 0. Supported values include *0* A node will remain in the DOWN state until a system administrator explicitly changes its state (even if the slurmd daemon registers and resumes communications). *1* A DOWN node will become available for use upon registration with a valid configuration only if it was set DOWN due to being non-responsive. If the node was set DOWN for any other reason (low memory, prolog failure, epilog failure, unexpected reboot, etc.), its state will not automatically be changed. *2* A DOWN node will become available for use upon registration with a valid configuration. The node could have been set DOWN for any reason. (Disabled on Cray systems.) Brian On Tue, Sep 9, 2014 at 9:11 AM, Brian B <[email protected]> wrote: > Greetings, > > Is it possible to have slurm compute nodes added back into a partition > after they perform an unscheduled reboot? That is we have some machines > that fail on brown outs (we are working on solving that problem) and will > reboot when this occurs. They come back fine but slurm doesn’t add them > back into he partition. I am able to do so using control by updating their > state to IDLE. Is this able to be automated? > > Regards, > Brian > >
