Hi I was wondering about the following. I have this situation

             84974 multiscal testpock   sdoerr PD       0:00      1
(Priority)
             84973 multiscal testpock   sdoerr PD       0:00      1
(Priority)
             81538 multiscal    RC_f7     miha  R   17:41:56      1 ace2
             81537 multiscal    RC_f6     miha  R   17:42:00      1 ace2
             81536 multiscal    RC_f5     miha  R   17:42:04      1 ace2
             81535 multiscal    RC_f4     miha  R   17:42:08      1 ace2
             81534 multiscal    RC_f3     miha  R   17:42:12      1 ace1
             81533 multiscal    RC_f2     miha  R   17:42:16      1 ace1
             81532 multiscal    RC_f1     miha  R   17:42:20      1 ace1
             81531 multiscal    RC_f0     miha  R   17:42:24      1 ace1


[sdoerr@xxx Fri10:35 slurmtest]  sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
multiscale*      up   infinite      1 drain* ace3
multiscale*      up   infinite      1  down* giallo
multiscale*      up   infinite      2    mix ace[1-2]
multiscale*      up   infinite      5   idle arancio,loro,oliva,rosa,suo


The miha jobs use exclude nodes to run only on machines with good GPUs
(ace1, ace2)
As you can see I have 5 machines idle which could serve my jobs but my jobs
are for some reason stuck in pending due to "priority". I am indeed very
sure that these 5 nodes satisfy the hardware requirements for my jobs (also
ran them yesterday).

It's just that for some reason, which we have had before, these
node-excluding miha jobs seem to get the rest stuck in priority. If we
cancel them, then mine will go through to the idle machines. However we
cannot figure out what is the cause for that. I paste below the scontrol
show job for one miha and one of my jobs.

Many thanks!


JobId=81534 JobName=RC_f3
   UserId=miha(3056) GroupId=lab(3000)
   Priority=33 Nice=0 Account=lab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=17:44:07 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-03-23T16:53:14 EligibleTime=2017-03-23T16:53:14
   StartTime=2017-03-23T16:53:15 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=multiscale AllocNode:Sid=blu:26225
   ReqNodeList=(null) ExcNodeList=arancio,giallo,loro,oliva,pink,rosa,suo
   NodeList=ace1
   BatchHost=ace1
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=11500,node=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=11500M MinTmpDiskNode=0
   Features=(null) Gres=gpu:1 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Power= SICP=0


JobId=84973 JobName=testpock2
   UserId=sdoerr(3041) GroupId=lab(3000)
   Priority=33 Nice=0 Account=lab QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-03-24T10:31:20 EligibleTime=2017-03-24T10:31:20
   StartTime=2019-03-24T10:38:10 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=multiscale AllocNode:Sid=loro:15424
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000,node=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=4000M MinTmpDiskNode=0
   Features=(null) Gres=gpu:1 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Power= SICP=0

Reply via email to