Dear all, Please consider the following partition configuration:
PartitionName=DEFAULT Nodes=cn-[1-8] MaxTime=3:00:00 State=UP Priority=1
Shared=YES
PartitionName=normal Priority=1 Default=YES
PartitionName=gpu Priority=2
PartitionName=highlong Priority=10 MaxTime=48:00:00 AllowGroups=brmlstaff
Users submit their cpu jobs to partition "normal" and their gpu jobs to
partition "gpu". All nodes are allocated to both partitions but gpu jobs are
prioritized by giving the corresponding partition a higher priority. Partition
"highlong" is used for special purposes but it does not concern this problem.
The nodes cn-[1-4,6] are equipped with a gpu, the nodes cn-[5,7-8] are cpu only:
NodeName=cn-1 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-2 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-3 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-5 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=7984 TmpDisk=187611
NodeName=cn-6 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=5928 TmpDisk=187611 Gres=gpu
NodeName=cn-7 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1
RealMemory=32398 TmpDisk=187611
NodeName=cn-8 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=15852 TmpDisk=187611
I submitted a collection of jobs that request two cpu cores and no gpu and
another collection that requests one cpu core and one gpu. The output of
"scontrol show job" for a cpu job reads:
JobId=24874 Name=107(cpu)
UserId=surban(10010) GroupId=brmlstaff(10013)
Priority=1000001 Account=normal QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
RunTime=02:12:00 TimeLimit=03:00:00 TimeMin=N/A
SubmitTime=2014-03-27T10:24:34 EligibleTime=2014-03-27T10:24:44
StartTime=2014-03-27T10:25:08 EndTime=2014-03-27T13:25:08
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=cn-login:43701
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cn-5
BatchHost=cn-5
NumNodes=1 NumCPUs=2 CPUs/Task=2 ReqS:C:T=*:*:*
MinCPUsNode=2 MinMemoryNode=1024M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/uhome/surban/dev/submit/submit/job-script.sh python -m
apps.shiftnet.shiftnet_mb 107 cpu 24873
/uhome/surban/dev/submit/submit/prolog.sh
WorkDir=/uhome/surban/dev/ml_jobs/apps/shiftnet/cfgs/hpsearch/500
And the corresponding output for a gpu job:
JobId=24924 Name=124(gpu)
UserId=surban(10010) GroupId=brmlstaff(10013)
Priority=2006169 Account=gpu QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
RunTime=00:23:16 TimeLimit=03:00:00 TimeMin=N/A
SubmitTime=2014-03-27T11:48:39 EligibleTime=2014-03-27T11:48:49
StartTime=2014-03-27T12:16:15 EndTime=2014-03-27T15:16:15
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=gpu AllocNode:Sid=cn-login:43701
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cn-1
BatchHost=cn-1
NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=1024M MinTmpDiskNode=0
Features=(null) Gres=gpu Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/uhome/surban/dev/submit/submit/job-script.sh python -m
apps.shiftnet.shiftnet_mb 124 gpu 24925
/uhome/surban/dev/submit/submit/prolog.sh
WorkDir=/uhome/surban/dev/ml_jobs/apps/shiftnet/cfgs/hpsearch/500
Now consider for example the node cn-3, which has 4 cpu cores and 1 gpu. I
would expect SLURM to allocate one gpu job and one cpu job to this node.
However what happens is that only one gpu job is allocated and the remaining 3
cpu cores are left idle.
Thus our job queue looks like this:
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
25829 gpu 425(gpu) surban PD 0:00 1 (Resources)
25838 gpu 428(gpu) surban PD 0:00 1 (Resources)
...
24957 gpu 135(gpu) surban PD 0:00 1 (Resources)
24954 gpu 134(gpu) surban R 1:53:40 1 cn-4
24918 gpu 122(gpu) surban R 42:16 1 cn-3
24924 gpu 124(gpu) surban R 14:40 1 cn-1
24951 gpu 133(gpu) surban R 10:27 1 cn-6
25893 highlong runit.sh jbayer R 19:54:41 1 cn-2
24856 normal 097(cpu) surban PD 0:00 1 (Resources)
24937 normal 128(cpu) surban PD 0:00 1 (Resources)
24940 normal 129(cpu) surban PD 0:00 1 (Resources)
24958 normal 135(cpu) surban PD 0:00 1 (Resources)
...
25851 normal 432(cpu) surban PD 0:00 1 (Resources)
24934 normal 127(cpu) surban R 1:40:42 1 cn-8
24853 normal 095(cpu) surban R 1:47:20 1 cn-8
24850 normal 094(cpu) surban R 1:51:35 1 cn-7
24727 normal 045(cpu) surban R 1:53:28 1 cn-7
24730 normal 046(cpu) surban R 1:53:28 1 cn-7
24733 normal 047(cpu) surban R 1:53:28 1 cn-7
24736 normal 048(cpu) surban R 1:53:28 1 cn-7
24739 normal 049(cpu) surban R 1:53:28 1 cn-7
24763 normal 063(cpu) surban R 1:53:28 1 cn-7
24769 normal 065(cpu) surban R 1:53:28 1 cn-7
24775 normal 067(cpu) surban R 1:53:28 1 cn-7
24778 normal 068(cpu) surban R 1:53:28 1 cn-7
24799 normal 075(cpu) surban R 1:53:28 1 cn-7
24838 normal 089(cpu) surban R 1:53:28 1 cn-7
24706 normal 037(cpu) surban R 1:53:30 1 cn-8
24721 normal 043(cpu) surban R 1:53:30 1 cn-8
24670 normal 025(cpu) surban R 1:55:39 1 cn-8
24631 normal 012(cpu) surban R 2:02:13 1 cn-5
24634 normal 013(cpu) surban R 2:02:13 1 cn-8
24643 normal 016(cpu) surban R 2:02:13 1 cn-8
24667 normal 024(cpu) surban R 2:02:13 1 cn-8
24862 normal 100(cpu) surban R 2:05:47 1 cn-5
24871 normal 106(cpu) surban R 2:05:47 1 cn-5
I.e. SLURM only allocates job requesting no gpu to cpu only nodes, although the
nodes containing a gpu have idle cpu cores!
Here is the output of "scontrol show node cn-3":
NodeName=cn-3 Arch=x86_64 CoresPerSocket=4
CPUAlloc=2 CPUErr=0 CPUTot=8 CPULoad=1.04 Features=(null)
Gres=gpu
NodeAddr=cn-3 NodeHostName=cn-3
OS=Linux RealMemory=32086 AllocMem=1024 Sockets=1 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=187612 Weight=1
BootTime=2014-03-24T09:34:56 SlurmdStartTime=2014-03-24T18:01:37
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Could anybody help with debugging or fixing this problem?
The full slurm.conf and the output of "sdiag" are attached. I am using SLURM
2.6.7.
Thanks,
Sebastian Urban
slurm.conf
Description: slurm.conf
*******************************************************
sdiag output at Thu Mar 27 12:54:14 2014
Data since Thu Mar 27 01:00:04 2014
*******************************************************
Server thread count: 3
Agent queue size: 0
Jobs submitted: 0
Jobs started: 120
Jobs completed: 108
Jobs canceled: 10
Jobs failed: 1
Main schedule statistics (microseconds):
Last cycle: 30137
Max cycle: 162051
Total cycles: 850
Mean cycle: 30382
Mean depth cycle: 613
Cycles per minute: 1
Last queue length: 600
Backfilling stats
Total backfilled jobs (since last slurm start): 501
Total backfilled jobs (since last stats cycle start): 102
Total cycles: 732
Last cycle when: Thu Mar 27 12:53:58 2014
Last cycle: 10482842
Max cycle: 12779696
Mean cycle: 3696589
Last depth cycle: 600
Last depth cycle (try sched): 600
Depth Mean: 556
Depth Mean (try depth): 556
Last queue length: 600
Queue length mean: 611
