[slurm-dev] Jobs waiting for gres gpu are blocking other jobs not requiring a gpu

Urban, Sebastian Thu, 27 Mar 2014 05:03:34 -0700

Dear all,

Please consider the following partition configuration:


PartitionName=DEFAULT Nodes=cn-[1-8] MaxTime=3:00:00 State=UP Priority=1 
Shared=YES
PartitionName=normal Priority=1 Default=YES
PartitionName=gpu Priority=2
PartitionName=highlong Priority=10 MaxTime=48:00:00 AllowGroups=brmlstaff

Users submit their cpu jobs to partition "normal" and their gpu jobs to 
partition "gpu". All nodes are allocated to both partitions but gpu jobs are 
prioritized by giving the corresponding partition a higher priority. Partition 
"highlong" is used for special purposes but it does not concern this problem.

The nodes cn-[1-4,6] are equipped with a gpu, the nodes cn-[5,7-8] are cpu only:

NodeName=cn-1  Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-2  Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-3  Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-4  Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=32086 TmpDisk=187612 Gres=gpu
NodeName=cn-5  Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 
RealMemory=7984 TmpDisk=187611
NodeName=cn-6  Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=5928 TmpDisk=187611 Gres=gpu
NodeName=cn-7  Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 
RealMemory=32398 TmpDisk=187611
NodeName=cn-8  Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 
RealMemory=15852 TmpDisk=187611

I submitted a collection of jobs that request two cpu cores and no gpu and 
another collection that requests one cpu core and one gpu. The output of 
"scontrol show job" for a cpu job reads:

JobId=24874 Name=107(cpu)
   UserId=surban(10010) GroupId=brmlstaff(10013)
   Priority=1000001 Account=normal QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
   RunTime=02:12:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2014-03-27T10:24:34 EligibleTime=2014-03-27T10:24:44
   StartTime=2014-03-27T10:25:08 EndTime=2014-03-27T13:25:08
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=cn-login:43701
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cn-5
   BatchHost=cn-5
   NumNodes=1 NumCPUs=2 CPUs/Task=2 ReqS:C:T=*:*:*
   MinCPUsNode=2 MinMemoryNode=1024M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/uhome/surban/dev/submit/submit/job-script.sh python -m 
apps.shiftnet.shiftnet_mb 107 cpu 24873 
/uhome/surban/dev/submit/submit/prolog.sh
   WorkDir=/uhome/surban/dev/ml_jobs/apps/shiftnet/cfgs/hpsearch/500

And the corresponding output for a gpu job:

JobId=24924 Name=124(gpu)
   UserId=surban(10010) GroupId=brmlstaff(10013)
   Priority=2006169 Account=gpu QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
   RunTime=00:23:16 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2014-03-27T11:48:39 EligibleTime=2014-03-27T11:48:49
   StartTime=2014-03-27T12:16:15 EndTime=2014-03-27T15:16:15
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=gpu AllocNode:Sid=cn-login:43701
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cn-1
   BatchHost=cn-1
   NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=1024M MinTmpDiskNode=0
   Features=(null) Gres=gpu Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/uhome/surban/dev/submit/submit/job-script.sh python -m 
apps.shiftnet.shiftnet_mb 124 gpu 24925 
/uhome/surban/dev/submit/submit/prolog.sh
   WorkDir=/uhome/surban/dev/ml_jobs/apps/shiftnet/cfgs/hpsearch/500

Now consider for example the node cn-3, which has 4 cpu cores and 1 gpu. I 
would expect SLURM to allocate one gpu job and one cpu job to this node. 
However what happens is that only one gpu job is allocated and the remaining 3 
cpu cores are left idle.

Thus our job queue looks like this:

             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
             25829       gpu 425(gpu)   surban PD       0:00      1 (Resources)
             25838       gpu 428(gpu)   surban PD       0:00      1 (Resources)
...
             24957       gpu 135(gpu)   surban PD       0:00      1 (Resources)
             24954       gpu 134(gpu)   surban  R    1:53:40      1 cn-4
             24918       gpu 122(gpu)   surban  R      42:16      1 cn-3
             24924       gpu 124(gpu)   surban  R      14:40      1 cn-1
             24951       gpu 133(gpu)   surban  R      10:27      1 cn-6
             25893  highlong runit.sh   jbayer  R   19:54:41      1 cn-2
             24856    normal 097(cpu)   surban PD       0:00      1 (Resources)
             24937    normal 128(cpu)   surban PD       0:00      1 (Resources)
             24940    normal 129(cpu)   surban PD       0:00      1 (Resources)
             24958    normal 135(cpu)   surban PD       0:00      1 (Resources)
...
             25851    normal 432(cpu)   surban PD       0:00      1 (Resources)
             24934    normal 127(cpu)   surban  R    1:40:42      1 cn-8
             24853    normal 095(cpu)   surban  R    1:47:20      1 cn-8
             24850    normal 094(cpu)   surban  R    1:51:35      1 cn-7
             24727    normal 045(cpu)   surban  R    1:53:28      1 cn-7
             24730    normal 046(cpu)   surban  R    1:53:28      1 cn-7
             24733    normal 047(cpu)   surban  R    1:53:28      1 cn-7
             24736    normal 048(cpu)   surban  R    1:53:28      1 cn-7
             24739    normal 049(cpu)   surban  R    1:53:28      1 cn-7
             24763    normal 063(cpu)   surban  R    1:53:28      1 cn-7
             24769    normal 065(cpu)   surban  R    1:53:28      1 cn-7
             24775    normal 067(cpu)   surban  R    1:53:28      1 cn-7
             24778    normal 068(cpu)   surban  R    1:53:28      1 cn-7
             24799    normal 075(cpu)   surban  R    1:53:28      1 cn-7
             24838    normal 089(cpu)   surban  R    1:53:28      1 cn-7
             24706    normal 037(cpu)   surban  R    1:53:30      1 cn-8
             24721    normal 043(cpu)   surban  R    1:53:30      1 cn-8
             24670    normal 025(cpu)   surban  R    1:55:39      1 cn-8
             24631    normal 012(cpu)   surban  R    2:02:13      1 cn-5
             24634    normal 013(cpu)   surban  R    2:02:13      1 cn-8
             24643    normal 016(cpu)   surban  R    2:02:13      1 cn-8
             24667    normal 024(cpu)   surban  R    2:02:13      1 cn-8
             24862    normal 100(cpu)   surban  R    2:05:47      1 cn-5
             24871    normal 106(cpu)   surban  R    2:05:47      1 cn-5

I.e. SLURM only allocates job requesting no gpu to cpu only nodes, although the 
nodes containing a gpu have idle cpu cores!

Here is the output of "scontrol show node cn-3":

NodeName=cn-3 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=2 CPUErr=0 CPUTot=8 CPULoad=1.04 Features=(null)
   Gres=gpu
   NodeAddr=cn-3 NodeHostName=cn-3
   OS=Linux RealMemory=32086 AllocMem=1024 Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=187612 Weight=1
   BootTime=2014-03-24T09:34:56 SlurmdStartTime=2014-03-24T18:01:37
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Could anybody help with debugging or fixing this problem? 

The full slurm.conf and the output of "sdiag" are attached. I am using SLURM 
2.6.7.

Thanks,
Sebastian Urban

slurm.conf
Description: slurm.conf

*******************************************************
sdiag output at Thu Mar 27 12:54:14 2014
Data since      Thu Mar 27 01:00:04 2014
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 0
Jobs started:   120
Jobs completed: 108
Jobs canceled:  10
Jobs failed:    1

Main schedule statistics (microseconds):
        Last cycle:   30137
        Max cycle:    162051
        Total cycles: 850
        Mean cycle:   30382
        Mean depth cycle:  613
        Cycles per minute: 1
        Last queue length: 600

Backfilling stats
        Total backfilled jobs (since last slurm start): 501
        Total backfilled jobs (since last stats cycle start): 102
        Total cycles: 732
        Last cycle when: Thu Mar 27 12:53:58 2014
        Last cycle: 10482842
        Max cycle:  12779696
        Mean cycle: 3696589
        Last depth cycle: 600
        Last depth cycle (try sched): 600
        Depth Mean: 556
        Depth Mean (try depth): 556
        Last queue length: 600
        Queue length mean: 611

[slurm-dev] Jobs waiting for gres gpu are blocking other jobs not requiring a gpu

Reply via email to