Hi I was wondering about the following. I have this situation
84974 multiscal testpock sdoerr PD 0:00 1
(Priority)
84973 multiscal testpock sdoerr PD 0:00 1
(Priority)
81538 multiscal RC_f7 miha R 17:41:56 1 ace2
81537 multiscal RC_f6 miha R 17:42:00 1 ace2
81536 multiscal RC_f5 miha R 17:42:04 1 ace2
81535 multiscal RC_f4 miha R 17:42:08 1 ace2
81534 multiscal RC_f3 miha R 17:42:12 1 ace1
81533 multiscal RC_f2 miha R 17:42:16 1 ace1
81532 multiscal RC_f1 miha R 17:42:20 1 ace1
81531 multiscal RC_f0 miha R 17:42:24 1 ace1
[sdoerr@xxx Fri10:35 slurmtest] sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
multiscale* up infinite 1 drain* ace3
multiscale* up infinite 1 down* giallo
multiscale* up infinite 2 mix ace[1-2]
multiscale* up infinite 5 idle arancio,loro,oliva,rosa,suo
The miha jobs use exclude nodes to run only on machines with good GPUs
(ace1, ace2)
As you can see I have 5 machines idle which could serve my jobs but my jobs
are for some reason stuck in pending due to "priority". I am indeed very
sure that these 5 nodes satisfy the hardware requirements for my jobs (also
ran them yesterday).
It's just that for some reason, which we have had before, these
node-excluding miha jobs seem to get the rest stuck in priority. If we
cancel them, then mine will go through to the idle machines. However we
cannot figure out what is the cause for that. I paste below the scontrol
show job for one miha and one of my jobs.
Many thanks!
JobId=81534 JobName=RC_f3
UserId=miha(3056) GroupId=lab(3000)
Priority=33 Nice=0 Account=lab QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=17:44:07 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2017-03-23T16:53:14 EligibleTime=2017-03-23T16:53:14
StartTime=2017-03-23T16:53:15 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=multiscale AllocNode:Sid=blu:26225
ReqNodeList=(null) ExcNodeList=arancio,giallo,loro,oliva,pink,rosa,suo
NodeList=ace1
BatchHost=ace1
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=11500,node=1,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=11500M MinTmpDiskNode=0
Features=(null) Gres=gpu:1 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Power= SICP=0
JobId=84973 JobName=testpock2
UserId=sdoerr(3041) GroupId=lab(3000)
Priority=33 Nice=0 Account=lab QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2017-03-24T10:31:20 EligibleTime=2017-03-24T10:31:20
StartTime=2019-03-24T10:38:10 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=multiscale AllocNode:Sid=loro:15424
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=4000,node=1,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=4000M MinTmpDiskNode=0
Features=(null) Gres=gpu:1 Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Power= SICP=0