Dear all,

I configured gpu nodes in slurm.conf like that :
...
*NodeName=nodgpu[01-05]  Procs=24 CoresPerSocket=12 RealMemory=128000
Sockets=2 ThreadsPerCore=1 TmpDisk=703488 Gres=gpu:4
Feature=Haswell,Tesla,k40m*
...

*GresTypes=Haswell,Tesla,Westmere,gpu,k40m*

and



*SelectType=select/cons_resSelectTypeParameters=CR_Socket_Memory*...

the gres.conf file on the five nodes:





*Name=gpu File=/dev/nvidia0  CPUs=0,2,4,6,8,10,12,14,16,18,20,22Name=gpu
File=/dev/nvidia1  CPUs=1,3,5,7,9,11,13,15,17,19,21,23Name=gpu
File=/dev/nvidia2  CPUs=0,2,4,6,8,10,12,14,16,18,20,22Name=gpu
File=/dev/nvidia3  CPUs=1,3,5,7,9,11,13,15,17,19,21,23Name=mic Count=0*

The cgroup.conf on each node:







*CgroupMountpoint="/sys/fs/cgroup"CgroupAutomount=yesCgroupReleaseAgentDir="/etc/slurm/cgroup"ConstrainRAMSpace=yesAllowedRAMSpace=100ConstrainCores=yesTaskAffinity=no*

The slurm version used is 14.11.11.

when i ask for one node with all their gpus, slurm tells that the node is
not available.

*salloc -p testgpu -N1 --ntasks-per-node 24  --gres=gpu:4*

salloc: Job allocation 101944 has been revoked.
salloc: error: Job submit/allocate failed: Requested node configuration is
not available

the node configuration read by slurm:











*scontrol show node nodgpuNodeName=nodgpu Arch=x86_64 CoresPerSocket=12
CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=0.94 Features=Haswell,Tesla,k40m
Gres=gpu:4   NodeAddr=nodgpu NodeHostName=nodgpu Version=14.11   OS=Linux
RealMemory=128704 AllocMem=64416 Sockets=2 Boards=1   State=ALLOCATED
ThreadsPerCore=1 TmpDisk=1726637 Weight=1   BootTime=2016-02-18T16:48:22
SlurmdStartTime=2016-02-18T17:14:56   CurrentWatts=0 LowestJoules=0
ConsumedJoules=0   ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s*
I don't know what is the problem.
Any idea?

Regards

Reply via email to