Stefan,

I believe I experienced this as well. Very similar situation at least (and 
still looking into it). Can you provide your conf files?

Best, Jared

From: Roe Zohar <[email protected]>
Reply-To: slurm-dev <[email protected]>
Date: Friday, March 24, 2017 at 6:17 AM
To: slurm-dev <[email protected]>
Subject: [slurm-dev] Re: Priority blocking jobs despite idle machines

I had the same problem. Didn't find a solution.

On Mar 24, 2017 12:59 PM, "Stefan Doerr" 
<[email protected]<mailto:[email protected]>> wrote:
OK so after some investigation it seems that the problem is when there are miha 
jobs pending in the queue (I truncated the squeue before for brevity).
These miha jobs require ace1, ace2 machines so they are pending since these 
machines are full right now.
For some reason SLURM thinks that because the miha jobs are pending it cannot 
work on my jobs (which don't require specific machines) and puts mine pending 
as well.

Once we cancelled the pending miha jobs (leaving the running ones running), I 
cancelled also my pending jobs, resent them and it worked.

This seems to me like a quite problematic limitation in SLURM.
Any opinions on this?

On Fri, Mar 24, 2017 at 10:43 AM, Stefan Doerr 
<[email protected]<mailto:[email protected]>> wrote:
Hi I was wondering about the following. I have this situation

             84974 multiscal testpock   sdoerr PD       0:00      1 (Priority)
             84973 multiscal testpock   sdoerr PD       0:00      1 (Priority)
             81538 multiscal    RC_f7     miha  R   17:41:56      1 ace2
             81537 multiscal    RC_f6     miha  R   17:42:00      1 ace2
             81536 multiscal    RC_f5     miha  R   17:42:04      1 ace2
             81535 multiscal    RC_f4     miha  R   17:42:08      1 ace2
             81534 multiscal    RC_f3     miha  R   17:42:12      1 ace1
             81533 multiscal    RC_f2     miha  R   17:42:16      1 ace1
             81532 multiscal    RC_f1     miha  R   17:42:20      1 ace1
             81531 multiscal    RC_f0     miha  R   17:42:24      1 ace1


[sdoerr@xxx Fri10:35 slurmtest]  sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
multiscale*      up   infinite      1 drain* ace3
multiscale*      up   infinite      1  down* giallo
multiscale*      up   infinite      2    mix ace[1-2]
multiscale*      up   infinite      5   idle arancio,loro,oliva,rosa,suo


The miha jobs use exclude nodes to run only on machines with good GPUs (ace1, 
ace2)
As you can see I have 5 machines idle which could serve my jobs but my jobs are 
for some reason stuck in pending due to "priority". I am indeed very sure that 
these 5 nodes satisfy the hardware requirements for my jobs (also ran them 
yesterday).

It's just that for some reason, which we have had before, these node-excluding 
miha jobs seem to get the rest stuck in priority. If we cancel them, then mine 
will go through to the idle machines. However we cannot figure out what is the 
cause for that. I paste below the scontrol show job for one miha and one of my 
jobs.

Many thanks!


JobId=81534 JobName=RC_f3
   UserId=miha(3056) GroupId=lab(3000)
   Priority=33 Nice=0 Account=lab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=17:44:07 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-03-23T16:53:14 EligibleTime=2017-03-23T16:53:14
   StartTime=2017-03-23T16:53:15 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=multiscale AllocNode:Sid=blu:26225
   ReqNodeList=(null) ExcNodeList=arancio,giallo,loro,oliva,pink,rosa,suo
   NodeList=ace1
   BatchHost=ace1
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=11500,node=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=11500M MinTmpDiskNode=0
   Features=(null) Gres=gpu:1 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Power= SICP=0


JobId=84973 JobName=testpock2
   UserId=sdoerr(3041) GroupId=lab(3000)
   Priority=33 Nice=0 Account=lab QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-03-24T10:31:20 EligibleTime=2017-03-24T10:31:20
   StartTime=2019-03-24T10:38:10 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=multiscale AllocNode:Sid=loro:15424
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000,node=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=4000M MinTmpDiskNode=0
   Features=(null) Gres=gpu:1 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Power= SICP=0



Reply via email to