[slurm-dev] Re: Gres GPU Problem with new slurm cluster

Jagga Soorma Sat, 29 Mar 2014 13:11:25 -0700

Okay, so looks like I just had to clear the node manually using
scontrol after updating the gres.conf on each node.  Isn't there a way
to have slurm automatically do this without having to do manual
intervention using the scontrol command or sview?


Thanks,
-J

On Sat, Mar 29, 2014 at 11:59 AM, Jagga Soorma <[email protected]> wrote:
> Hi Everyone,
>
> I am switching over from torque to slurm on a new cluster with gpu
> resources.  I have installed the latest stable release 14.03.0-1.  I
> have 2 nvidia gpu's on each node:
>
> --
> amber203:/etc/slurm # ls -l /dev/nvidia*
> crw-rw-rw- 1 root video 195,   0 Mar 29 11:46 /dev/nvidia0
> crw-rw-rw- 1 root video 195,   1 Mar 29 11:46 /dev/nvidia1
> crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl
>
> amber203:/etc/slurm # nvidia-smi | grep Tesla
> |   0  Tesla K20Xm         Off  | 0000:08:00.0     Off |                    0 
> |
> |   1  Tesla K20Xm         Off  | 0000:27:00.0     Off |                    0 
> |
> --
>
> I have also updated the slurm.conf and gres.conf files across the
> cluster with the following:
>
> --
> amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
> GresTypes=gpu
> NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
> CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
> PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
> MaxTime=INFINITE State=UP
>
> amber203:/etc/slurm # cat /etc/slurm/gres.conf
> NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
> --
>
> However, after restarting all slurm services I am still getting the
> following "grew/gpu count to low" message when running sinfo:
>
> --
>
> amber203:/etc/slurm # sinfo -lNe
> Sat Mar 29 11:57:40 2014
> NODELIST                            NODES     PARTITION       STATE
> CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
> amber201                                1 ambergpuprod*        idle
> 20   2:10:1  32074        0      1   (null) none
> amber[202,210,222,224-226,228-240]     19 ambergpuprod*       down*
> 20   2:10:1  32074        0      1   (null) Not responding
> amber203                                1 ambergpuprod*    drained*
> 20   2:10:1  32074        0      1   (null) gres/gpu count too l
> amber[204-209,211-221,223,227]         19 ambergpuprod*     drained
> 20   2:10:1  32074        0      1   (null) gres/gpu count too l
> --
>
> What am I missing here or how can I get more information about why
> sinfo is reporting gpu count is too low?  I am also tried the
> following format in the gres.conf file without any luck:
>
> --
> Name=gpu File=/dev/nvidia0
> Name=gpu File=/dev/nvidia1
> --
>
> Any help would be greatly appreciated!
>
> Thanks,
> -J

[slurm-dev] Re: Gres GPU Problem with new slurm cluster

Reply via email to