Does your slurmd detect the devices? I see these messages in the slurmd.log on my GPU nodes:
[2014-08-08T17:21:37.273] gpu 0 is device number 0 [2014-08-08T17:21:37.273] gpu 1 is device number 1 I am also assuming that the nodes are fully configured to use the GPUs with /dev/nvidia* device files. [root@gpu001 slurm]# ls -l /dev/nvidia* crw-rw-rw- 1 root root 195, 0 Aug 4 17:33 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Aug 4 17:33 /dev/nvidia1 crw-rw-rw- 1 root root 195, 255 Aug 4 17:33 /dev/nvidiactl Mike On Aug 8, 2014, at 12:35 PM, Krishna Teja <[email protected]> wrote: > I did restart slurmctld and slurmd on master node and slurmd on compute > nodes. When i do scontrol show nodes, the nodes do have a Gres entry but at > the end i get "Reason=gres/gpu count too low". slurmd.log hasn't got any > errors (without turning on debugging) > > Any ideas on how to fix that? > > Krishna > > > On Fri, Aug 8, 2014 at 2:24 PM, Michael Robbert <[email protected]> wrote: > Have you restarted slurmd on the nodes? I'm not sure if it is needed, but > also restarting slurmctld on the master would be a good idea as well. A good > check is to look at the output of "scontrol show node compute-0-4". It should > have a Gres= entry. If all else fails look at the slurmd.log on the compute > nodes and maybe try turning up the debugging if that doesn't show enough info. > > Mike Robbert > > On Aug 8, 2014, at 9:09 AM, Krishna Teja <[email protected]> wrote: > >> Ok i've taken care of that part, no more Duplicated NodeHostName...error but >> the nodes still aren't configured for GPU's. Am i missing some step to be >> done after editing slurm.conf and creating gres.conf files? >> >> >> On Fri, Aug 8, 2014 at 10:37 AM, Seren Soner <[email protected]> wrote: >> You probably have redefined compute-0-[4-6] in /etc/slurm/nodenames.conf. >> >> >> On Fri, Aug 8, 2014 at 5:29 PM, Krishna Teja <[email protected]> wrote: >> I have been trying to configure SLURM so as to be able to use GPU's >> available in some of the nodes in our cluster (compute-0-4,compute-0-5 and >> compute-0-6 to be precise). I have followed the instructions given in the >> SLURM website. >> >> http://slurm.schedmd.com/gres.html >> >> But that doesn't seem to work. I still get the same error as if the GPU's >> weren't configured. >> >> srun: error: Unable to allocate resources: Requested node configuration is >> not available >> >> Furthermore, i run a simple command to test if everything is fine with >> SLURM, to print the hostnames of all the nodes using >> >> srun -N7 -l /bin/hostname >> >> and i get the following output. >> >> srun: error: Duplicated NodeHostName compute-0-4 in the config file >> srun: error: Duplicated NodeHostName compute-0-5 in the config file >> srun: error: Duplicated NodeHostName compute-0-6 in the config file >> 4: compute-0-4.local >> 5: compute-0-5.local >> 6: compute-0-6.local >> 3: compute-0-3.local >> 1: compute-0-1.local >> 0: compute-0-0.local >> 2: compute-0-2.local >> >> I have attached the slurm.conf file and gres.conf file. Can someone please >> point to me what i am doing wrong. Any help appreciated!! >> >> >> >> >> -- >> Seren Soner >> >> > > > >
smime.p7s
Description: S/MIME cryptographic signature
