Shorten your time specification for this job if possible. Ask your admins :)
2017-03-24 16:32 GMT+01:00 Stefan Doerr <[email protected]>: > Hm I don't know how since only the admins have access to that stuff. I > could ask them if you could be a bit more specific :) > > On Fri, Mar 24, 2017 at 1:39 PM, Jared David Baker <[email protected]> > wrote: > >> Stefan, >> >> >> >> I believe I experienced this as well. Very similar situation at least >> (and still looking into it). Can you provide your conf files? >> >> >> >> Best, Jared >> >> >> >> *From: *Roe Zohar <[email protected]> >> *Reply-To: *slurm-dev <[email protected]> >> *Date: *Friday, March 24, 2017 at 6:17 AM >> *To: *slurm-dev <[email protected]> >> *Subject: *[slurm-dev] Re: Priority blocking jobs despite idle machines >> >> >> >> I had the same problem. Didn't find a solution. >> >> >> >> On Mar 24, 2017 12:59 PM, "Stefan Doerr" <[email protected]> wrote: >> >> OK so after some investigation it seems that the problem is when there >> are miha jobs pending in the queue (I truncated the squeue before for >> brevity). >> >> These miha jobs require ace1, ace2 machines so they are pending since >> these machines are full right now. >> >> For some reason SLURM thinks that because the miha jobs are pending it >> cannot work on my jobs (which don't require specific machines) and puts >> mine pending as well. >> >> >> >> Once we cancelled the pending miha jobs (leaving the running ones >> running), I cancelled also my pending jobs, resent them and it worked. >> >> >> >> This seems to me like a quite problematic limitation in SLURM. >> >> Any opinions on this? >> >> >> >> On Fri, Mar 24, 2017 at 10:43 AM, Stefan Doerr <[email protected]> >> wrote: >> >> Hi I was wondering about the following. I have this situation >> >> >> >> 84974 multiscal testpock sdoerr PD 0:00 1 >> (Priority) >> >> 84973 multiscal testpock sdoerr PD 0:00 1 >> (Priority) >> >> 81538 multiscal RC_f7 miha R 17:41:56 1 ace2 >> >> 81537 multiscal RC_f6 miha R 17:42:00 1 ace2 >> >> 81536 multiscal RC_f5 miha R 17:42:04 1 ace2 >> >> 81535 multiscal RC_f4 miha R 17:42:08 1 ace2 >> >> 81534 multiscal RC_f3 miha R 17:42:12 1 ace1 >> >> 81533 multiscal RC_f2 miha R 17:42:16 1 ace1 >> >> 81532 multiscal RC_f1 miha R 17:42:20 1 ace1 >> >> 81531 multiscal RC_f0 miha R 17:42:24 1 ace1 >> >> >> >> >> >> [sdoerr@xxx Fri10:35 slurmtest] sinfo >> >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> >> multiscale* up infinite 1 drain* ace3 >> >> multiscale* up infinite 1 down* giallo >> >> multiscale* up infinite 2 mix ace[1-2] >> >> multiscale* up infinite 5 idle arancio,loro,oliva,rosa,suo >> >> >> >> >> >> The miha jobs use exclude nodes to run only on machines with good GPUs >> (ace1, ace2) >> >> As you can see I have 5 machines idle which could serve my jobs but my >> jobs are for some reason stuck in pending due to "priority". I am indeed >> very sure that these 5 nodes satisfy the hardware requirements for my jobs >> (also ran them yesterday). >> >> >> >> It's just that for some reason, which we have had before, these >> node-excluding miha jobs seem to get the rest stuck in priority. If we >> cancel them, then mine will go through to the idle machines. However we >> cannot figure out what is the cause for that. I paste below the scontrol >> show job for one miha and one of my jobs. >> >> >> >> Many thanks! >> >> >> >> >> >> JobId=81534 JobName=RC_f3 >> >> UserId=miha(3056) GroupId=lab(3000) >> >> Priority=33 Nice=0 Account=lab QOS=normal >> >> JobState=RUNNING Reason=None Dependency=(null) >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >> >> RunTime=17:44:07 TimeLimit=UNLIMITED TimeMin=N/A >> >> SubmitTime=2017-03-23T16:53:14 EligibleTime=2017-03-23T16:53:14 >> >> StartTime=2017-03-23T16:53:15 EndTime=Unknown >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> >> Partition=multiscale AllocNode:Sid=blu:26225 >> >> ReqNodeList=(null) ExcNodeList=arancio,giallo,loro,oliva,pink,rosa,suo >> >> NodeList=ace1 >> >> BatchHost=ace1 >> >> NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> >> TRES=cpu=1,mem=11500,node=1,gres/gpu=1 >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> >> MinCPUsNode=1 MinMemoryNode=11500M MinTmpDiskNode=0 >> >> Features=(null) Gres=gpu:1 Reservation=(null) >> >> Shared=OK Contiguous=0 Licenses=(null) Network=(null) >> >> Power= SICP=0 >> >> >> >> >> >> JobId=84973 JobName=testpock2 >> >> UserId=sdoerr(3041) GroupId=lab(3000) >> >> Priority=33 Nice=0 Account=lab QOS=normal >> >> JobState=PENDING Reason=Priority Dependency=(null) >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >> >> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A >> >> SubmitTime=2017-03-24T10:31:20 EligibleTime=2017-03-24T10:31:20 >> >> StartTime=2019-03-24T10:38:10 EndTime=Unknown >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> >> Partition=multiscale AllocNode:Sid=loro:15424 >> >> ReqNodeList=(null) ExcNodeList=(null) >> >> NodeList=(null) >> >> NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> >> TRES=cpu=1,mem=4000,node=1,gres/gpu=1 >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> >> MinCPUsNode=1 MinMemoryNode=4000M MinTmpDiskNode=0 >> >> Features=(null) Gres=gpu:1 Reservation=(null) >> >> Shared=OK Contiguous=0 Licenses=(null) Network=(null) >> >> Power= SICP=0 >> >> >> >> >> >> >> >> >
