Okay, so looks like I just had to clear the node manually using scontrol after updating the gres.conf on each node. Isn't there a way to have slurm automatically do this without having to do manual intervention using the scontrol command or sview?
Thanks, -J On Sat, Mar 29, 2014 at 11:59 AM, Jagga Soorma <[email protected]> wrote: > Hi Everyone, > > I am switching over from torque to slurm on a new cluster with gpu > resources. I have installed the latest stable release 14.03.0-1. I > have 2 nvidia gpu's on each node: > > -- > amber203:/etc/slurm # ls -l /dev/nvidia* > crw-rw-rw- 1 root video 195, 0 Mar 29 11:46 /dev/nvidia0 > crw-rw-rw- 1 root video 195, 1 Mar 29 11:46 /dev/nvidia1 > crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl > > amber203:/etc/slurm # nvidia-smi | grep Tesla > | 0 Tesla K20Xm Off | 0000:08:00.0 Off | 0 > | > | 1 Tesla K20Xm Off | 0000:27:00.0 Off | 0 > | > -- > > I have also updated the slurm.conf and gres.conf files across the > cluster with the following: > > -- > amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf > GresTypes=gpu > NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2 > CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN > PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES > MaxTime=INFINITE State=UP > > amber203:/etc/slurm # cat /etc/slurm/gres.conf > NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1] > -- > > However, after restarting all slurm services I am still getting the > following "grew/gpu count to low" message when running sinfo: > > -- > > amber203:/etc/slurm # sinfo -lNe > Sat Mar 29 11:57:40 2014 > NODELIST NODES PARTITION STATE > CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON > amber201 1 ambergpuprod* idle > 20 2:10:1 32074 0 1 (null) none > amber[202,210,222,224-226,228-240] 19 ambergpuprod* down* > 20 2:10:1 32074 0 1 (null) Not responding > amber203 1 ambergpuprod* drained* > 20 2:10:1 32074 0 1 (null) gres/gpu count too l > amber[204-209,211-221,223,227] 19 ambergpuprod* drained > 20 2:10:1 32074 0 1 (null) gres/gpu count too l > -- > > What am I missing here or how can I get more information about why > sinfo is reporting gpu count is too low? I am also tried the > following format in the gres.conf file without any luck: > > -- > Name=gpu File=/dev/nvidia0 > Name=gpu File=/dev/nvidia1 > -- > > Any help would be greatly appreciated! > > Thanks, > -J
