>
> CUDA_VISIBLE_DEVICES is a CUDA API variable, and CUDA doesn't know
> anything about "SLURM's table" of GPUs

SLURM does set CUDA_VISIBLE_DEVICES:
https://github.com/SchedMD/slurm/blob/master/src/plugins/gres/gpu/gres_gpu.c#L165


> If you are going to set this variable from SLURM, it needs to set relative
> to the device
> numbering known to CUDA, not SLURM's own device numbering.
>
Slurm needs a way to assign a specific device (from gres.conf's File= list)
and set CUDA_VISIBLE_DEVICES to that specific file. Right now, slurm does
not do it correctly. It will set the device starting from 0 and
incrementing until the max number of device is reached, ignoring completely
what is set in gres.conf's File= list. Is that what you meant?

I couldn't find how to extract which File= the job was allocated to so I
could correctly set CUDA_VISIBLE_DEVICES in gres_gpu.c. My best bet is
gres_bit_alloc but its bit field nature makes it complicated...

Reply via email to