[slurm-dev] Re: Gres GPU Problem with new slurm cluster

David Bigagli Mon, 31 Mar 2014 09:46:32 -0700


You may want to look at the variable ReturnToService in slurm.conf


http://slurm.schedmd.com/slurm.conf.html

On 03/29/2014 01:11 PM, Jagga Soorma wrote:


Okay, so looks like I just had to clear the node manually using
scontrol after updating the gres.conf on each node.  Isn't there a way
to have slurm automatically do this without having to do manual
intervention using the scontrol command or sview?

Thanks,
-J

On Sat, Mar 29, 2014 at 11:59 AM, Jagga Soorma <[email protected]> wrote:

Hi Everyone,

I am switching over from torque to slurm on a new cluster with gpu
resources.  I have installed the latest stable release 14.03.0-1.  I
have 2 nvidia gpu's on each node:

--
amber203:/etc/slurm # ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195,   0 Mar 29 11:46 /dev/nvidia0
crw-rw-rw- 1 root video 195,   1 Mar 29 11:46 /dev/nvidia1
crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl

amber203:/etc/slurm # nvidia-smi | grep Tesla
|   0  Tesla K20Xm         Off  | 0000:08:00.0     Off |                    0 |
|   1  Tesla K20Xm         Off  | 0000:27:00.0     Off |                    0 |
--

I have also updated the slurm.conf and gres.conf files across the
cluster with the following:

--
amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
MaxTime=INFINITE State=UP

amber203:/etc/slurm # cat /etc/slurm/gres.conf
NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
--

However, after restarting all slurm services I am still getting the
following "grew/gpu count to low" message when running sinfo:

--

amber203:/etc/slurm # sinfo -lNe
Sat Mar 29 11:57:40 2014
NODELIST                            NODES     PARTITION       STATE
CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
amber201                                1 ambergpuprod*        idle
20   2:10:1  32074        0      1   (null) none
amber[202,210,222,224-226,228-240]     19 ambergpuprod*       down*
20   2:10:1  32074        0      1   (null) Not responding
amber203                                1 ambergpuprod*    drained*
20   2:10:1  32074        0      1   (null) gres/gpu count too l
amber[204-209,211-221,223,227]         19 ambergpuprod*     drained
20   2:10:1  32074        0      1   (null) gres/gpu count too l
--

What am I missing here or how can I get more information about why
sinfo is reporting gpu count is too low?  I am also tried the
following format in the gres.conf file without any luck:

--
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
--

Any help would be greatly appreciated!

Thanks,
-J


--

Thanks,
      /David/Bigagli

www.schedmd.com

[slurm-dev] Re: Gres GPU Problem with new slurm cluster

Reply via email to