(Sorry if this comes twice; the copy I sent yesterday doesn't seem to
have gone through, at least not back to me.)

We're in the process of setting up a few GPU nodes in our cluster, and
want to use Gres to control access to them.

Currently, we have activated one node with 2 GPUs.  The gres.conf file
on that node reads

----------------
## /etc/slurm/gres.conf for c19-1.
## This file is generated by /hpc/sbin/update_slurm_gres.
## Any modifications will be lost on next restart of slurm!

Name=gpu Count=2 File=/dev/nvidia[0-1]
Name=localtmp Count=1800
----------------

(the localtmp is just counting access to local tmp disk.)  Nodes without
GPUs have gres.conf files like this:

----------------
## /etc/slurm/gres.conf for c18-1.
## This file is generated by /hpc/sbin/update_slurm_gres.
## Any modifications will be lost on next restart of slurm!

Name=gpu Count=0
Name=localtmp Count=90
----------------

slurm.conf contains the following:

GresTypes=gpu,localtmp
Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 
Gres=localtmp:90 State=unknown
[...]
Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 
CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib


Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for
the job.  However, the values seem a bit strange:

- If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0.

- If we submit two jobs with --gres:1 at the same time,
  CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for
  the other.

- If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the
  value 0,1633906540

Is this correct?  Are we doing something wrong?

(This is slurm 2.4.3, running on Rocks 6.0 based on CentOS 6.2.)


-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Reply via email to