[slurm-dev] Re: starting slurmd only after GPUs are fully initialized

Andy Riebs Fri, 29 Aug 2014 10:49:12 -0700

One way to work around this is to set the node definition(s) inslurm.conf with "State=DOWN". That way, manual intervention will berequired when a node is rebooted, allowing the rest of the system tofinish coming up.


Andy

On 08/29/2014 12:13 PM, Lev Givon wrote:

I recently set up slurm 2.6.5 on a cluster of Ubuntu 14.04.1 systems hosting 
several
NVIDIA GPUs set up as generic resources. When the compute nodes are rebooted, I
noticed that they attempt to start slurmd before the device files initialized by
the nvidia kernel module appear, i.e., the following  message appears in syslog
some number of lines before the GPU kernel driver load messages.

slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or 
directory

Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
started before any GPU device files appear?

[slurm-dev] Re: starting slurmd only after GPUs are fully initialized

Reply via email to