[slurm-dev] Gres GPU Problem with new slurm cluster

Jagga Soorma Sat, 29 Mar 2014 12:01:28 -0700

Hi Everyone,

I am switching over from torque to slurm on a new cluster with gpu
resources.  I have installed the latest stable release 14.03.0-1.  I
have 2 nvidia gpu's on each node:


--
amber203:/etc/slurm # ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195,   0 Mar 29 11:46 /dev/nvidia0
crw-rw-rw- 1 root video 195,   1 Mar 29 11:46 /dev/nvidia1
crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl

amber203:/etc/slurm # nvidia-smi | grep Tesla
|   0  Tesla K20Xm         Off  | 0000:08:00.0     Off |                    0 |
|   1  Tesla K20Xm         Off  | 0000:27:00.0     Off |                    0 |
--

I have also updated the slurm.conf and gres.conf files across the
cluster with the following:

--
amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
MaxTime=INFINITE State=UP

amber203:/etc/slurm # cat /etc/slurm/gres.conf
NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
--

However, after restarting all slurm services I am still getting the
following "grew/gpu count to low" message when running sinfo:

--

amber203:/etc/slurm # sinfo -lNe
Sat Mar 29 11:57:40 2014
NODELIST                            NODES     PARTITION       STATE
CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
amber201                                1 ambergpuprod*        idle
20   2:10:1  32074        0      1   (null) none
amber[202,210,222,224-226,228-240]     19 ambergpuprod*       down*
20   2:10:1  32074        0      1   (null) Not responding
amber203                                1 ambergpuprod*    drained*
20   2:10:1  32074        0      1   (null) gres/gpu count too l
amber[204-209,211-221,223,227]         19 ambergpuprod*     drained
20   2:10:1  32074        0      1   (null) gres/gpu count too l
--

What am I missing here or how can I get more information about why
sinfo is reporting gpu count is too low?  I am also tried the
following format in the gres.conf file without any luck:

--
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
--

Any help would be greatly appreciated!

Thanks,
-J

[slurm-dev] Gres GPU Problem with new slurm cluster

Reply via email to