Isn't your slurm.conf saying you have 1 GPU when your gres.conf says you
have 2? For reference here's what I have:
$ cat /etc/slurm/gres.conf
Name=gpu Type=tesla File=/dev/nvidia0
Name=gpu Type=tesla File=/dev/nvidia1
Name=gpu Type=tesla File=/dev/nvidia2
Name=gpu Type=tesla File=/dev/nvidia3
$ grep Gres /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=sn[2-3] State=UNKNOWN Boards=1 SocketsPerBoard=2
CoresPerSocket=12 ThreadsPerCore=1 weight=10000 Gres=gpu:tesla:4
--
Jeff White
HPC Systems Engineer
Information Technology Services - WSU
On 06/21/2016 05:44 AM, Tom Deakin wrote:
Hello everyone,
I’m having trouble getting SLURM to choose the 2nd GPU on this node.
gres.conf:
NodeName=node21 Name=gpu Type=gtx680 File=/dev/nvidia0
NodeName=node21 Name=gpu Type=gtx580 File=/dev/nvidia1
slurm.conf:
NodeName=node21 Gres=gpu:gtx680:1,gpu:gtx580:1
If I then run srun --gres=gpu:gtx580 I get CUDA_VISIBLE_DEVICES=0
If I also run srun --gres=gpu:gtx680 I get CUDA_VISIBLE_DEVICES=0
I also get some errors in the slurmctld log file when I specify
--gres=gpu:gtx580, which I don’t understand:
[2016-06-21T13:41:27.968] error: gres/gpu: job 228 dealloc node node21
topo gres count underflow (0 1)
[2016-06-21T13:41:27.968] error: gres/gpu: job 228 dealloc node node21
type gtx680 gres count underflow (0 1)
Can anyone please offer some advice?
Many thanks,
Tom Deakin