Hi all,

I'm using slurm to submit jobs on workstations which have 3 GPUs. Two are
used for GPGPU and one is used for the monitor. Clearly, I don't want jobs
submited to the monitor card.

Thus, my /etc/slurm/gres.conf contains the followin:

> Name=gpu File=/dev/nvidia0
> Name=gpu File=/dev/nvidia2
>
Note that /dev/nvidia1 is not present there: this dev entry is the
underpowered card that drives the user's monitor.

Unfortunately, it seems slurm will submit jobs to /dev/nvidia1 when
/dev/nvidia0 is busy (or if asking for a job with 2 gpu).

I've checked the source to see where the decision was taken. In
src/plugins/gres/gpu/gres_gpu.c, the function job_set_env() sets
"CUDA_VISIBLE_DEVICES", used to identify the device chosen by slurm for
cuda to run on.

I have trouble understanding where the choice of which gpu device is taken.
I think this information is encoded in gres_job_ptr->gres_bit_alloc[0] and
that job_set_env() is fine, but can't find the logic to where this is set.

Anyone could provide a clue?

Thanks a lot.

Nicolas

Reply via email to