Isn't your slurm.conf saying you have 1 GPU when your gres.conf says you have 2? For reference here's what I have:

$ cat /etc/slurm/gres.conf
Name=gpu Type=tesla File=/dev/nvidia0
Name=gpu Type=tesla File=/dev/nvidia1
Name=gpu Type=tesla File=/dev/nvidia2
Name=gpu Type=tesla File=/dev/nvidia3

$ grep Gres /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=sn[2-3] State=UNKNOWN Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 weight=10000 Gres=gpu:tesla:4

--
Jeff White
HPC Systems Engineer
Information Technology Services - WSU

On 06/21/2016 05:44 AM, Tom Deakin wrote:
Hello everyone,

I’m having trouble getting SLURM to choose the 2nd GPU on this node.

gres.conf:

NodeName=node21 Name=gpu Type=gtx680 File=/dev/nvidia0
NodeName=node21 Name=gpu Type=gtx580 File=/dev/nvidia1

slurm.conf:

NodeName=node21 Gres=gpu:gtx680:1,gpu:gtx580:1

If I then run srun --gres=gpu:gtx580 I get CUDA_VISIBLE_DEVICES=0
If I also run srun --gres=gpu:gtx680 I get CUDA_VISIBLE_DEVICES=0

I also get some errors in the slurmctld log file when I specify --gres=gpu:gtx580, which I don’t understand:

[2016-06-21T13:41:27.968] error: gres/gpu: job 228 dealloc node node21 topo gres count underflow (0 1) [2016-06-21T13:41:27.968] error: gres/gpu: job 228 dealloc node node21 type gtx680 gres count underflow (0 1)


Can anyone please offer some advice?

Many thanks,

Tom Deakin


Reply via email to