Hi all, I'm using slurm to submit jobs on workstations which have 3 GPUs. Two are used for GPGPU and one is used for the monitor. Clearly, I don't want jobs submited to the monitor card.
Thus, my /etc/slurm/gres.conf contains the followin: > Name=gpu File=/dev/nvidia0 > Name=gpu File=/dev/nvidia2 > Note that /dev/nvidia1 is not present there: this dev entry is the underpowered card that drives the user's monitor. Unfortunately, it seems slurm will submit jobs to /dev/nvidia1 when /dev/nvidia0 is busy (or if asking for a job with 2 gpu). I've checked the source to see where the decision was taken. In src/plugins/gres/gpu/gres_gpu.c, the function job_set_env() sets "CUDA_VISIBLE_DEVICES", used to identify the device chosen by slurm for cuda to run on. I have trouble understanding where the choice of which gpu device is taken. I think this information is encoded in gres_job_ptr->gres_bit_alloc[0] and that job_set_env() is fine, but can't find the logic to where this is set. Anyone could provide a clue? Thanks a lot. Nicolas
