One way to work around this is to set the node definition(s) in slurm.conf with "State=DOWN". That way, manual intervention will be required when a node is rebooted, allowing the rest of the system to finish coming up.

Andy

On 08/29/2014 12:13 PM, Lev Givon wrote:
I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting 
several
NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I
noticed that they attempt to start slurmd before the device files initialized by
the nvidia kernel module appear, i.e., the following  message appears in syslog
some number of lines before the GPU kernel driver load messages.

slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or 
directory

Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?

Reply via email to