[slurm-dev] Re: How to debug a job that won't start

John Desantis Fri, 13 Mar 2015 04:27:58 -0700

Uwe,

I didn't see anything outstanding in the logs which would suggest an error.


Looking at the "production" account, are there any other active jobs
or allocations (by users in that account) in place using one node
which would technically make the free nodes stand at 17 versus 18?
What happens if you update the account so that there is no limit on
GrpNodes?

Have you checked on all free nodes which should be able to host the
job for issues?  Do you have any reservations in the system which are
preserving resources? How does the partition configuration look?  Are
you specifying MaxNodes?

If I were in your shoes, I'd test the account by using -w and
--exclusive with 'salloc' on each of the nodes listed in the debug
output below.  I'd run it 18 times to test the GrpNodes limit.

This is all that I can think of right now.  I'll have another espresso
soon enough and will reply if anything else comes to mind.  I hope
this helps!

John DeSantis



2015-03-12 4:59 GMT-04:00 Uwe Sauter <[email protected]>:
>
> No one able to give a hint?
>
> Am 10.03.2015 um 17:05 schrieb Uwe Sauter:
>>
>> Hi,
>>
>> I have an account "production" configured with limitations GrpNodes=18, 
>> MaxNodes=18, MaxWall=7-00:00:00, an associated user with
>> limitations MaxNodes=18, MaxWall=7-00:00:00 and a QoS with limitations 
>> Priority=10, GraceTime=00:00:00, PreemtMode=cluster,
>> Flags=DenyOnLimit, UsageFact0r=1.0, MinCPUs=1.
>>
>> This user submitted a job that is within those limitations:
>>
>> JobId=14115 JobName=XXX
>>    UserId=XXX(XXX) GroupId=XXX(XXX)
>>    Priority=1214 Nice=0 Account=production QOS=normal
>>    JobState=PENDING Reason=Resources Dependency=(null)
>>    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>    RunTime=00:00:00 TimeLimit=4-00:00:00 TimeMin=N/A
>>    SubmitTime=2015-03-10T13:08:56 EligibleTime=2015-03-10T13:08:56
>>    StartTime=Unknown EndTime=Unknown
>>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>    Partition=MyPartition AllocNode:Sid=frontend:15414
>>    ReqNodeList=(null) ExcNodeList=(null)
>>    NodeList=(null)
>>    NumNodes=18-18 NumCPUs=180 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>>    Features=(null) Gres=(null) Reservation=(null)
>>    Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>>    Command=(null)
>>    WorkDir=/XXX
>>    StdErr=/XXX/slurm-14115.out
>>    StdIn=/dev/null
>>    StdOut=/XXX/slurm-14115.out
>>    Switches=1@1-00:00:00
>>
>> Submitting the same job with a lower node count (e.g. 17) immediately starts 
>> the job on that account. There is a second job with
>> lower priority for that account in the queue and enough free nodes in the 
>> cluster that a 18 node job is able to run.
>>
>> How can I debug what's going on and get this job running? Turning on 
>> scheduler debugging only shows:
>>
>> scheduler log:
>> [2015-03-10T13:33:11.834] sched: JobId=14115. State=PENDING. 
>> Reason=Resources. Priority=1000. Partition=MyPartition.
>> [2015-03-10T13:33:11.834] sched: JobId=14116. State=PENDING. 
>> Reason=Priority(Priority), Priority=1000, Partition=MyPartition.
>>
>>
>> slurmctld log:
>> [2015-03-10T15:01:42.259] Set DebugFlags to 
>> Backfill,BackfillMap,SelectType,TraceJobs
>> [2015-03-10T15:02:32.508] 
>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
>> [2015-03-10T15:03:02.178] backfill: beginning
>> [2015-03-10T15:03:02.178] =========================================
>> [2015-03-10T15:03:02.178] Begin:2015-03-10T15:03:02 End:2015-03-11T15:03:02
>> Nodes:n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901]
>> [2015-03-10T15:03:02.178] =========================================
>> [2015-03-10T15:03:02.178] backfill test for JobID=14115 Prio=1000 
>> Partition=MyPartition
>> [2015-03-10T15:03:02.178] Test job 14115 at 2015-03-10T15:03:02 on
>> n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901]
>> [2015-03-10T15:03:02.178] backfill test for JobID=14116 Prio=1000 
>> Partition=MyPartition
>> [2015-03-10T15:03:02.178] Test job 14116 at 2015-03-10T15:03:02 on
>> n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901]
>> [2015-03-10T15:03:02.178] backfill: reached end of job queue
>> [2015-03-10T15:03:02.178] backfill: completed testing 2(2) jobs, usec=775
>> [2015-03-10T15:04:24.991] Set DebugFlags to none
>>
>>
>>
>>
>> Thanks,
>>
>>       Uwe
>>

[slurm-dev] Re: How to debug a job that won't start

Reply via email to