Hi Everyone, I am switching over from torque to slurm on a new cluster with gpu resources. I have installed the latest stable release 14.03.0-1. I have 2 nvidia gpu's on each node:
-- amber203:/etc/slurm # ls -l /dev/nvidia* crw-rw-rw- 1 root video 195, 0 Mar 29 11:46 /dev/nvidia0 crw-rw-rw- 1 root video 195, 1 Mar 29 11:46 /dev/nvidia1 crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl amber203:/etc/slurm # nvidia-smi | grep Tesla | 0 Tesla K20Xm Off | 0000:08:00.0 Off | 0 | | 1 Tesla K20Xm Off | 0000:27:00.0 Off | 0 | -- I have also updated the slurm.conf and gres.conf files across the cluster with the following: -- amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf GresTypes=gpu NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2 CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES MaxTime=INFINITE State=UP amber203:/etc/slurm # cat /etc/slurm/gres.conf NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1] -- However, after restarting all slurm services I am still getting the following "grew/gpu count to low" message when running sinfo: -- amber203:/etc/slurm # sinfo -lNe Sat Mar 29 11:57:40 2014 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON amber201 1 ambergpuprod* idle 20 2:10:1 32074 0 1 (null) none amber[202,210,222,224-226,228-240] 19 ambergpuprod* down* 20 2:10:1 32074 0 1 (null) Not responding amber203 1 ambergpuprod* drained* 20 2:10:1 32074 0 1 (null) gres/gpu count too l amber[204-209,211-221,223,227] 19 ambergpuprod* drained 20 2:10:1 32074 0 1 (null) gres/gpu count too l -- What am I missing here or how can I get more information about why sinfo is reporting gpu count is too low? I am also tried the following format in the gres.conf file without any luck: -- Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 -- Any help would be greatly appreciated! Thanks, -J
