[slurm-dev] Re: How to debug a job that won't start

John Desantis Fri, 13 Mar 2015 05:44:36 -0700

Uwe,

I had thought of asking you about the switches flag, but from what I
could see in the debug output it looked like there were more than
enough nodes to satisfy the request.


I'm also a little embarrassed that I mentioned the partition
definition of MaxNodes, alluding to 18 being too many - the job would
have been rejected with a message indicating that the partition limits
have been breached.  But hey, that's what happens when you answer
emails ~20 minutes after waking up!

John DeSantis

2015-03-13 7:55 GMT-04:00 Uwe Sauter <[email protected]>:
>
> Hi,
>
> thanks for looking into this.
>
> After looking further into this it came to mind that the jobs were submitted 
> with --switches=1. As we have a max of 18 nodes per
> switch the probability is high that there just wasn't a switch whose nodes 
> were idle although enough nodes in total were.
>
> So I think it had nothing to do with the GrpNodes or MaxNodes limitation for 
> that account but simply that the restriction of
> switches=1 wasn't satisfied.
>
> It would be nice if the job reason would be a bit more detailed than it is 
> right now, at least for the admin who has to answer to
> the users.
>
> Regards,
>
>         Uwe
>
>
> Am 13.03.2015 um 12:28 schrieb John Desantis:
>>
>> Uwe,
>>
>> I didn't see anything outstanding in the logs which would suggest an error.
>>
>> Looking at the "production" account, are there any other active jobs
>> or allocations (by users in that account) in place using one node
>> which would technically make the free nodes stand at 17 versus 18?
>> What happens if you update the account so that there is no limit on
>> GrpNodes?
>>
>> Have you checked on all free nodes which should be able to host the
>> job for issues?  Do you have any reservations in the system which are
>> preserving resources? How does the partition configuration look?  Are
>> you specifying MaxNodes?
>>
>> If I were in your shoes, I'd test the account by using -w and
>> --exclusive with 'salloc' on each of the nodes listed in the debug
>> output below.  I'd run it 18 times to test the GrpNodes limit.
>>
>> This is all that I can think of right now.  I'll have another espresso
>> soon enough and will reply if anything else comes to mind.  I hope
>> this helps!
>>
>> John DeSantis
>>
>>
>>
>> 2015-03-12 4:59 GMT-04:00 Uwe Sauter <[email protected]>:
>>>
>>> No one able to give a hint?
>>>
>>> Am 10.03.2015 um 17:05 schrieb Uwe Sauter:
>>>>
>>>> Hi,
>>>>
>>>> I have an account "production" configured with limitations GrpNodes=18, 
>>>> MaxNodes=18, MaxWall=7-00:00:00, an associated user with
>>>> limitations MaxNodes=18, MaxWall=7-00:00:00 and a QoS with limitations 
>>>> Priority=10, GraceTime=00:00:00, PreemtMode=cluster,
>>>> Flags=DenyOnLimit, UsageFact0r=1.0, MinCPUs=1.
>>>>
>>>> This user submitted a job that is within those limitations:
>>>>
>>>> JobId=14115 JobName=XXX
>>>>    UserId=XXX(XXX) GroupId=XXX(XXX)
>>>>    Priority=1214 Nice=0 Account=production QOS=normal
>>>>    JobState=PENDING Reason=Resources Dependency=(null)
>>>>    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>>>    RunTime=00:00:00 TimeLimit=4-00:00:00 TimeMin=N/A
>>>>    SubmitTime=2015-03-10T13:08:56 EligibleTime=2015-03-10T13:08:56
>>>>    StartTime=Unknown EndTime=Unknown
>>>>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>>>    Partition=MyPartition AllocNode:Sid=frontend:15414
>>>>    ReqNodeList=(null) ExcNodeList=(null)
>>>>    NodeList=(null)
>>>>    NumNodes=18-18 NumCPUs=180 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>>>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>>>>    Features=(null) Gres=(null) Reservation=(null)
>>>>    Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>>>>    Command=(null)
>>>>    WorkDir=/XXX
>>>>    StdErr=/XXX/slurm-14115.out
>>>>    StdIn=/dev/null
>>>>    StdOut=/XXX/slurm-14115.out
>>>>    Switches=1@1-00:00:00
>>>>
>>>> Submitting the same job with a lower node count (e.g. 17) immediately 
>>>> starts the job on that account. There is a second job with
>>>> lower priority for that account in the queue and enough free nodes in the 
>>>> cluster that a 18 node job is able to run.
>>>>
>>>> How can I debug what's going on and get this job running? Turning on 
>>>> scheduler debugging only shows:
>>>>
>>>> scheduler log:
>>>> [2015-03-10T13:33:11.834] sched: JobId=14115. State=PENDING. 
>>>> Reason=Resources. Priority=1000. Partition=MyPartition.
>>>> [2015-03-10T13:33:11.834] sched: JobId=14116. State=PENDING. 
>>>> Reason=Priority(Priority), Priority=1000, Partition=MyPartition.
>>>>
>>>>
>>>> slurmctld log:
>>>> [2015-03-10T15:01:42.259] Set DebugFlags to 
>>>> Backfill,BackfillMap,SelectType,TraceJobs
>>>> [2015-03-10T15:02:32.508] 
>>>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
>>>> [2015-03-10T15:03:02.178] backfill: beginning
>>>> [2015-03-10T15:03:02.178] =========================================
>>>> [2015-03-10T15:03:02.178] Begin:2015-03-10T15:03:02 End:2015-03-11T15:03:02
>>>> Nodes:n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901]
>>>> [2015-03-10T15:03:02.178] =========================================
>>>> [2015-03-10T15:03:02.178] backfill test for JobID=14115 Prio=1000 
>>>> Partition=MyPartition
>>>> [2015-03-10T15:03:02.178] Test job 14115 at 2015-03-10T15:03:02 on
>>>> n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901]
>>>> [2015-03-10T15:03:02.178] backfill test for JobID=14116 Prio=1000 
>>>> Partition=MyPartition
>>>> [2015-03-10T15:03:02.178] Test job 14116 at 2015-03-10T15:03:02 on
>>>> n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901]
>>>> [2015-03-10T15:03:02.178] backfill: reached end of job queue
>>>> [2015-03-10T15:03:02.178] backfill: completed testing 2(2) jobs, usec=775
>>>> [2015-03-10T15:04:24.991] Set DebugFlags to none
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>       Uwe
>>> >

[slurm-dev] Re: How to debug a job that won't start

Reply via email to