Shorten your time specification for this job if possible.  Ask your admins
:)

2017-03-24 16:32 GMT+01:00 Stefan Doerr <[email protected]>:

> Hm I don't know how since only the admins have access to that stuff. I
> could ask them if you could be a bit more specific :)
>
> On Fri, Mar 24, 2017 at 1:39 PM, Jared David Baker <[email protected]>
> wrote:
>
>> Stefan,
>>
>>
>>
>> I believe I experienced this as well. Very similar situation at least
>> (and still looking into it). Can you provide your conf files?
>>
>>
>>
>> Best, Jared
>>
>>
>>
>> *From: *Roe Zohar <[email protected]>
>> *Reply-To: *slurm-dev <[email protected]>
>> *Date: *Friday, March 24, 2017 at 6:17 AM
>> *To: *slurm-dev <[email protected]>
>> *Subject: *[slurm-dev] Re: Priority blocking jobs despite idle machines
>>
>>
>>
>> I had the same problem. Didn't find a solution.
>>
>>
>>
>> On Mar 24, 2017 12:59 PM, "Stefan Doerr" <[email protected]> wrote:
>>
>> OK so after some investigation it seems that the problem is when there
>> are miha jobs pending in the queue (I truncated the squeue before for
>> brevity).
>>
>> These miha jobs require ace1, ace2 machines so they are pending since
>> these machines are full right now.
>>
>> For some reason SLURM thinks that because the miha jobs are pending it
>> cannot work on my jobs (which don't require specific machines) and puts
>> mine pending as well.
>>
>>
>>
>> Once we cancelled the pending miha jobs (leaving the running ones
>> running), I cancelled also my pending jobs, resent them and it worked.
>>
>>
>>
>> This seems to me like a quite problematic limitation in SLURM.
>>
>> Any opinions on this?
>>
>>
>>
>> On Fri, Mar 24, 2017 at 10:43 AM, Stefan Doerr <[email protected]>
>> wrote:
>>
>> Hi I was wondering about the following. I have this situation
>>
>>
>>
>>              84974 multiscal testpock   sdoerr PD       0:00      1
>> (Priority)
>>
>>              84973 multiscal testpock   sdoerr PD       0:00      1
>> (Priority)
>>
>>              81538 multiscal    RC_f7     miha  R   17:41:56      1 ace2
>>
>>              81537 multiscal    RC_f6     miha  R   17:42:00      1 ace2
>>
>>              81536 multiscal    RC_f5     miha  R   17:42:04      1 ace2
>>
>>              81535 multiscal    RC_f4     miha  R   17:42:08      1 ace2
>>
>>              81534 multiscal    RC_f3     miha  R   17:42:12      1 ace1
>>
>>              81533 multiscal    RC_f2     miha  R   17:42:16      1 ace1
>>
>>              81532 multiscal    RC_f1     miha  R   17:42:20      1 ace1
>>
>>              81531 multiscal    RC_f0     miha  R   17:42:24      1 ace1
>>
>>
>>
>>
>>
>> [sdoerr@xxx Fri10:35 slurmtest]  sinfo
>>
>> PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>
>> multiscale*      up   infinite      1 drain* ace3
>>
>> multiscale*      up   infinite      1  down* giallo
>>
>> multiscale*      up   infinite      2    mix ace[1-2]
>>
>> multiscale*      up   infinite      5   idle arancio,loro,oliva,rosa,suo
>>
>>
>>
>>
>>
>> The miha jobs use exclude nodes to run only on machines with good GPUs
>> (ace1, ace2)
>>
>> As you can see I have 5 machines idle which could serve my jobs but my
>> jobs are for some reason stuck in pending due to "priority". I am indeed
>> very sure that these 5 nodes satisfy the hardware requirements for my jobs
>> (also ran them yesterday).
>>
>>
>>
>> It's just that for some reason, which we have had before, these
>> node-excluding miha jobs seem to get the rest stuck in priority. If we
>> cancel them, then mine will go through to the idle machines. However we
>> cannot figure out what is the cause for that. I paste below the scontrol
>> show job for one miha and one of my jobs.
>>
>>
>>
>> Many thanks!
>>
>>
>>
>>
>>
>> JobId=81534 JobName=RC_f3
>>
>>    UserId=miha(3056) GroupId=lab(3000)
>>
>>    Priority=33 Nice=0 Account=lab QOS=normal
>>
>>    JobState=RUNNING Reason=None Dependency=(null)
>>
>>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>
>>    RunTime=17:44:07 TimeLimit=UNLIMITED TimeMin=N/A
>>
>>    SubmitTime=2017-03-23T16:53:14 EligibleTime=2017-03-23T16:53:14
>>
>>    StartTime=2017-03-23T16:53:15 EndTime=Unknown
>>
>>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>
>>    Partition=multiscale AllocNode:Sid=blu:26225
>>
>>    ReqNodeList=(null) ExcNodeList=arancio,giallo,loro,oliva,pink,rosa,suo
>>
>>    NodeList=ace1
>>
>>    BatchHost=ace1
>>
>>    NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>
>>    TRES=cpu=1,mem=11500,node=1,gres/gpu=1
>>
>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>
>>    MinCPUsNode=1 MinMemoryNode=11500M MinTmpDiskNode=0
>>
>>    Features=(null) Gres=gpu:1 Reservation=(null)
>>
>>    Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>>
>>    Power= SICP=0
>>
>>
>>
>>
>>
>> JobId=84973 JobName=testpock2
>>
>>    UserId=sdoerr(3041) GroupId=lab(3000)
>>
>>    Priority=33 Nice=0 Account=lab QOS=normal
>>
>>    JobState=PENDING Reason=Priority Dependency=(null)
>>
>>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>
>>    RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>>
>>    SubmitTime=2017-03-24T10:31:20 EligibleTime=2017-03-24T10:31:20
>>
>>    StartTime=2019-03-24T10:38:10 EndTime=Unknown
>>
>>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>
>>    Partition=multiscale AllocNode:Sid=loro:15424
>>
>>    ReqNodeList=(null) ExcNodeList=(null)
>>
>>    NodeList=(null)
>>
>>    NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>
>>    TRES=cpu=1,mem=4000,node=1,gres/gpu=1
>>
>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>
>>    MinCPUsNode=1 MinMemoryNode=4000M MinTmpDiskNode=0
>>
>>    Features=(null) Gres=gpu:1 Reservation=(null)
>>
>>    Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>>
>>    Power= SICP=0
>>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to