Hi Tom,

On Tue, Jun 21, 2016 at 5:44 AM, Tom Deakin <tom.dea...@bristol.ac.uk> wrote:
> I’m having trouble getting SLURM to choose the 2nd GPU on this node.

> If I then run srun --gres=gpu:gtx580 I get CUDA_VISIBLE_DEVICES=0
> If I also run srun --gres=gpu:gtx680 I get CUDA_VISIBLE_DEVICES=0

Do you have ConstrainDevices=yes in cgroup.conf?
If that's the case, it's perfectly normal. CUDA_VISIBLE_DEVICES works
in a context-dependent fashion and is always relative to the number of
devices accessible in the current environment. The 1st GPU is always
0. So I guess in each case, you only have 1 GPU accessible in the job
context, due to cgroup limitations, and Slurm does the right thing of
setting CUDA_VISIBLE_DEVICES to 0.
You can make sure of this by using "nvidia-smi -L" in each job, you'll
see the unique identifier of each GPU, and you'll be able to verify
that each GPU is correctly selected in each case.

It's completely counter-intuitive on multi-tenant environments, but
that's unfortunately the way it works right now.
For more background on this, please see
https://bugs.schedmd.com/show_bug.cgi?id=1421

Cheers,
-- 
Kilian

Reply via email to