Last night, all our compute nodes did an unexepected reboot at the exact
same time (00:00:00). Since this was not scheduled by us, we're trying
to find out who/what orchestrated this mass-reboot. Unfortunately, there
is nothing in any of the system or SLURM controller logs.
It must almost certainly be related to the maintenance reservation (for
all nodes) that ended at that same time. However, afaik a reboot is not
standard procedure when a maintenance reservation ends, or is it?
Also, we didn't run any 'scontrol reboot_nodes' during the maintenance
reservation or before. (Or did we? Are these logged somewhere?)
During the maintenance, the primary controller aborted due to a network
disconnect, which we missed because the backup controller happily took
over. Could the reboots have resulted from this? If so, is there a way
to prevent them?
Are there any other likely causes that we've missed?
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology