*I forgot to precise that there are available nodes with gpu, in the mail posted before, i used this command for gpu01*
*scontrol show node nodgpu01NodeName=nodgpu01 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.10 Features=Haswell,Tesla,k40m Gres=gpu:4 NodeAddr=sirocco03 NodeHostName=sirocco03 Version=14.11 OS=Linux RealMemory=128704 AllocMem=0 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=1726637 Weight=1 BootTime=2016-02-18T16:48:24 SlurmdStartTime=2016-02-29T10:17:24 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s* *Regads* 2016-03-03 11:37 GMT+01:00 Redouane Bouchouirbat <[email protected]>: > Dear all, > > I configured gpu nodes in slurm.conf like that : > ... > *NodeName=nodgpu[01-05] Procs=24 CoresPerSocket=12 RealMemory=128000 > Sockets=2 ThreadsPerCore=1 TmpDisk=703488 Gres=gpu:4 > Feature=Haswell,Tesla,k40m* > ... > > *GresTypes=Haswell,Tesla,Westmere,gpu,k40m* > > and > > > > *SelectType=select/cons_resSelectTypeParameters=CR_Socket_Memory*... > > the gres.conf file on the five nodes: > > > > > > *Name=gpu File=/dev/nvidia0 CPUs=0,2,4,6,8,10,12,14,16,18,20,22Name=gpu > File=/dev/nvidia1 CPUs=1,3,5,7,9,11,13,15,17,19,21,23Name=gpu > File=/dev/nvidia2 CPUs=0,2,4,6,8,10,12,14,16,18,20,22Name=gpu > File=/dev/nvidia3 CPUs=1,3,5,7,9,11,13,15,17,19,21,23Name=mic Count=0* > > The cgroup.conf on each node: > > > > > > > > > *CgroupMountpoint="/sys/fs/cgroup"CgroupAutomount=yesCgroupReleaseAgentDir="/etc/slurm/cgroup"ConstrainRAMSpace=yesAllowedRAMSpace=100ConstrainCores=yesTaskAffinity=no* > > The slurm version used is 14.11.11. > > when i ask for one node with all their gpus, slurm tells that the node is > not available. > > *salloc -p testgpu -N1 --ntasks-per-node 24 --gres=gpu:4* > > salloc: Job allocation 101944 has been revoked. > salloc: error: Job submit/allocate failed: Requested node configuration is > not available > > the node configuration read by slurm: > > > > > > > > > > > > *scontrol show node nodgpuNodeName=nodgpu Arch=x86_64 CoresPerSocket=12 > CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=0.94 Features=Haswell,Tesla,k40m > Gres=gpu:4 NodeAddr=nodgpu NodeHostName=nodgpu Version=14.11 OS=Linux > RealMemory=128704 AllocMem=64416 Sockets=2 Boards=1 State=ALLOCATED > ThreadsPerCore=1 TmpDisk=1726637 Weight=1 BootTime=2016-02-18T16:48:22 > SlurmdStartTime=2016-02-18T17:14:56 CurrentWatts=0 LowestJoules=0 > ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 > ExtSensorsTemp=n/s* > I don't know what is the problem. > Any idea? > > Regards > > > > > >
