Hi Tom, On Tue, Jun 21, 2016 at 5:44 AM, Tom Deakin <tom.dea...@bristol.ac.uk> wrote: > I’m having trouble getting SLURM to choose the 2nd GPU on this node.
> If I then run srun --gres=gpu:gtx580 I get CUDA_VISIBLE_DEVICES=0 > If I also run srun --gres=gpu:gtx680 I get CUDA_VISIBLE_DEVICES=0 Do you have ConstrainDevices=yes in cgroup.conf? If that's the case, it's perfectly normal. CUDA_VISIBLE_DEVICES works in a context-dependent fashion and is always relative to the number of devices accessible in the current environment. The 1st GPU is always 0. So I guess in each case, you only have 1 GPU accessible in the job context, due to cgroup limitations, and Slurm does the right thing of setting CUDA_VISIBLE_DEVICES to 0. You can make sure of this by using "nvidia-smi -L" in each job, you'll see the unique identifier of each GPU, and you'll be able to verify that each GPU is correctly selected in each case. It's completely counter-intuitive on multi-tenant environments, but that's unfortunately the way it works right now. For more background on this, please see https://bugs.schedmd.com/show_bug.cgi?id=1421 Cheers, -- Kilian