Dear all, I configured gpu nodes in slurm.conf like that : ... *NodeName=nodgpu[01-05] Procs=24 CoresPerSocket=12 RealMemory=128000 Sockets=2 ThreadsPerCore=1 TmpDisk=703488 Gres=gpu:4 Feature=Haswell,Tesla,k40m* ...
*GresTypes=Haswell,Tesla,Westmere,gpu,k40m* and *SelectType=select/cons_resSelectTypeParameters=CR_Socket_Memory*... the gres.conf file on the five nodes: *Name=gpu File=/dev/nvidia0 CPUs=0,2,4,6,8,10,12,14,16,18,20,22Name=gpu File=/dev/nvidia1 CPUs=1,3,5,7,9,11,13,15,17,19,21,23Name=gpu File=/dev/nvidia2 CPUs=0,2,4,6,8,10,12,14,16,18,20,22Name=gpu File=/dev/nvidia3 CPUs=1,3,5,7,9,11,13,15,17,19,21,23Name=mic Count=0* The cgroup.conf on each node: *CgroupMountpoint="/sys/fs/cgroup"CgroupAutomount=yesCgroupReleaseAgentDir="/etc/slurm/cgroup"ConstrainRAMSpace=yesAllowedRAMSpace=100ConstrainCores=yesTaskAffinity=no* The slurm version used is 14.11.11. when i ask for one node with all their gpus, slurm tells that the node is not available. *salloc -p testgpu -N1 --ntasks-per-node 24 --gres=gpu:4* salloc: Job allocation 101944 has been revoked. salloc: error: Job submit/allocate failed: Requested node configuration is not available the node configuration read by slurm: *scontrol show node nodgpuNodeName=nodgpu Arch=x86_64 CoresPerSocket=12 CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=0.94 Features=Haswell,Tesla,k40m Gres=gpu:4 NodeAddr=nodgpu NodeHostName=nodgpu Version=14.11 OS=Linux RealMemory=128704 AllocMem=64416 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=1726637 Weight=1 BootTime=2016-02-18T16:48:22 SlurmdStartTime=2016-02-18T17:14:56 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s* I don't know what is the problem. Any idea? Regards
