Stefan, I believe I experienced this as well. Very similar situation at least (and still looking into it). Can you provide your conf files?
Best, Jared From: Roe Zohar <[email protected]> Reply-To: slurm-dev <[email protected]> Date: Friday, March 24, 2017 at 6:17 AM To: slurm-dev <[email protected]> Subject: [slurm-dev] Re: Priority blocking jobs despite idle machines I had the same problem. Didn't find a solution. On Mar 24, 2017 12:59 PM, "Stefan Doerr" <[email protected]<mailto:[email protected]>> wrote: OK so after some investigation it seems that the problem is when there are miha jobs pending in the queue (I truncated the squeue before for brevity). These miha jobs require ace1, ace2 machines so they are pending since these machines are full right now. For some reason SLURM thinks that because the miha jobs are pending it cannot work on my jobs (which don't require specific machines) and puts mine pending as well. Once we cancelled the pending miha jobs (leaving the running ones running), I cancelled also my pending jobs, resent them and it worked. This seems to me like a quite problematic limitation in SLURM. Any opinions on this? On Fri, Mar 24, 2017 at 10:43 AM, Stefan Doerr <[email protected]<mailto:[email protected]>> wrote: Hi I was wondering about the following. I have this situation 84974 multiscal testpock sdoerr PD 0:00 1 (Priority) 84973 multiscal testpock sdoerr PD 0:00 1 (Priority) 81538 multiscal RC_f7 miha R 17:41:56 1 ace2 81537 multiscal RC_f6 miha R 17:42:00 1 ace2 81536 multiscal RC_f5 miha R 17:42:04 1 ace2 81535 multiscal RC_f4 miha R 17:42:08 1 ace2 81534 multiscal RC_f3 miha R 17:42:12 1 ace1 81533 multiscal RC_f2 miha R 17:42:16 1 ace1 81532 multiscal RC_f1 miha R 17:42:20 1 ace1 81531 multiscal RC_f0 miha R 17:42:24 1 ace1 [sdoerr@xxx Fri10:35 slurmtest] sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST multiscale* up infinite 1 drain* ace3 multiscale* up infinite 1 down* giallo multiscale* up infinite 2 mix ace[1-2] multiscale* up infinite 5 idle arancio,loro,oliva,rosa,suo The miha jobs use exclude nodes to run only on machines with good GPUs (ace1, ace2) As you can see I have 5 machines idle which could serve my jobs but my jobs are for some reason stuck in pending due to "priority". I am indeed very sure that these 5 nodes satisfy the hardware requirements for my jobs (also ran them yesterday). It's just that for some reason, which we have had before, these node-excluding miha jobs seem to get the rest stuck in priority. If we cancel them, then mine will go through to the idle machines. However we cannot figure out what is the cause for that. I paste below the scontrol show job for one miha and one of my jobs. Many thanks! JobId=81534 JobName=RC_f3 UserId=miha(3056) GroupId=lab(3000) Priority=33 Nice=0 Account=lab QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=17:44:07 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2017-03-23T16:53:14 EligibleTime=2017-03-23T16:53:14 StartTime=2017-03-23T16:53:15 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=multiscale AllocNode:Sid=blu:26225 ReqNodeList=(null) ExcNodeList=arancio,giallo,loro,oliva,pink,rosa,suo NodeList=ace1 BatchHost=ace1 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=11500,node=1,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=11500M MinTmpDiskNode=0 Features=(null) Gres=gpu:1 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Power= SICP=0 JobId=84973 JobName=testpock2 UserId=sdoerr(3041) GroupId=lab(3000) Priority=33 Nice=0 Account=lab QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2017-03-24T10:31:20 EligibleTime=2017-03-24T10:31:20 StartTime=2019-03-24T10:38:10 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=multiscale AllocNode:Sid=loro:15424 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000,node=1,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=4000M MinTmpDiskNode=0 Features=(null) Gres=gpu:1 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Power= SICP=0
