Dear all,

Last night, all our compute nodes did an unexepected reboot at the exact same time (00:00:00). Since this was not scheduled by us, we're trying to find out who/what orchestrated this mass-reboot. Unfortunately, there is nothing in any of the system or SLURM controller logs.

It must almost certainly be related to the maintenance reservation (for all nodes) that ended at that same time. However, afaik a reboot is not standard procedure when a maintenance reservation ends, or is it?

Also, we didn't run any 'scontrol reboot_nodes' during the maintenance reservation or before. (Or did we? Are these logged somewhere?)

During the maintenance, the primary controller aborted due to a network disconnect, which we missed because the backup controller happily took over. Could the reboots have resulted from this? If so, is there a way to prevent them?

Are there any other likely causes that we've missed?



Robbert Eggermont                                  Intelligent Systems         Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234                         Delft University of Technology

Reply via email to