> > CUDA_VISIBLE_DEVICES is a CUDA API variable, and CUDA doesn't know > anything about "SLURM's table" of GPUs
SLURM does set CUDA_VISIBLE_DEVICES: https://github.com/SchedMD/slurm/blob/master/src/plugins/gres/gpu/gres_gpu.c#L165 > If you are going to set this variable from SLURM, it needs to set relative > to the device > numbering known to CUDA, not SLURM's own device numbering. > Slurm needs a way to assign a specific device (from gres.conf's File= list) and set CUDA_VISIBLE_DEVICES to that specific file. Right now, slurm does not do it correctly. It will set the device starting from 0 and incrementing until the max number of device is reached, ignoring completely what is set in gres.conf's File= list. Is that what you meant? I couldn't find how to extract which File= the job was allocated to so I could correctly set CUDA_VISIBLE_DEVICES in gres_gpu.c. My best bet is gres_bit_alloc but its bit field nature makes it complicated...