Does your slurmd detect the devices? I see these messages in the slurmd.log on 
my GPU nodes:

[2014-08-08T17:21:37.273] gpu 0 is device number 0
[2014-08-08T17:21:37.273] gpu 1 is device number 1

I am also assuming that the nodes are fully configured to use the GPUs with 
/dev/nvidia* device files.

[root@gpu001 slurm]# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Aug  4 17:33 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Aug  4 17:33 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Aug  4 17:33 /dev/nvidiactl

Mike

On Aug 8, 2014, at 12:35 PM, Krishna Teja <[email protected]> wrote:

> I did restart slurmctld and slurmd on master node and slurmd on compute 
> nodes. When i do scontrol show nodes, the nodes do have a Gres entry but at 
> the end i get "Reason=gres/gpu count too low". slurmd.log hasn't got any 
> errors (without turning on debugging)
> 
> Any ideas on how to fix that?
> 
> Krishna
> 
> 
> On Fri, Aug 8, 2014 at 2:24 PM, Michael Robbert <[email protected]> wrote:
> Have you restarted slurmd on the nodes? I'm not sure if it is needed, but 
> also restarting slurmctld on the master would be a good idea as well. A good 
> check is to look at the output of "scontrol show node compute-0-4". It should 
> have a Gres= entry. If all else fails look at the slurmd.log on the compute 
> nodes and maybe try turning up the debugging if that doesn't show enough info.
> 
> Mike Robbert
> 
> On Aug 8, 2014, at 9:09 AM, Krishna Teja <[email protected]> wrote:
> 
>> Ok i've taken care of that part, no more Duplicated NodeHostName...error but 
>> the nodes still aren't configured for GPU's. Am i missing some step to be 
>> done after editing slurm.conf and creating gres.conf files?
>> 
>> 
>> On Fri, Aug 8, 2014 at 10:37 AM, Seren Soner <[email protected]> wrote:
>> You probably have redefined compute-0-[4-6] in /etc/slurm/nodenames.conf.
>> 
>> 
>> On Fri, Aug 8, 2014 at 5:29 PM, Krishna Teja <[email protected]> wrote:
>> I have been trying to configure SLURM so as to be able to use GPU's 
>> available in some of the nodes in our cluster (compute-0-4,compute-0-5 and 
>> compute-0-6 to be precise). I have followed the instructions given in the 
>> SLURM website.
>> 
>> http://slurm.schedmd.com/gres.html
>> 
>> But that doesn't seem to work. I still get the same error as if the GPU's 
>> weren't configured.
>> 
>> srun: error: Unable to allocate resources: Requested node configuration is 
>> not available
>> 
>> Furthermore, i run a simple command to test if everything is fine with 
>> SLURM, to print the hostnames of all the nodes using
>> 
>> srun -N7 -l /bin/hostname
>> 
>> and i get the following output.
>> 
>> srun: error: Duplicated NodeHostName compute-0-4 in the config file
>> srun: error: Duplicated NodeHostName compute-0-5 in the config file
>> srun: error: Duplicated NodeHostName compute-0-6 in the config file
>> 4: compute-0-4.local
>> 5: compute-0-5.local
>> 6: compute-0-6.local
>> 3: compute-0-3.local
>> 1: compute-0-1.local
>> 0: compute-0-0.local
>> 2: compute-0-2.local
>> 
>> I have attached the slurm.conf file and gres.conf file. Can someone please 
>> point to me what i am doing wrong. Any help appreciated!!
>> 
>> 
>> 
>> 
>> -- 
>> Seren Soner
>>  
>> 
> 
> 
> 
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to