One way to work around this is to set the node definition(s) in slurm.conf with "State=DOWN". That way, manual intervention will be required when a node is rebooted, allowing the rest of the system to finish coming up.
Andy On 08/29/2014 12:13 PM, Lev Givon wrote:
I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting several NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I noticed that they attempt to start slurmd before the device files initialized by the nvidia kernel module appear, i.e., the following message appears in syslog some number of lines before the GPU kernel driver load messages. slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't started before any GPU device files appear?