Uwe, I had thought of asking you about the switches flag, but from what I could see in the debug output it looked like there were more than enough nodes to satisfy the request.
I'm also a little embarrassed that I mentioned the partition definition of MaxNodes, alluding to 18 being too many - the job would have been rejected with a message indicating that the partition limits have been breached. But hey, that's what happens when you answer emails ~20 minutes after waking up! John DeSantis 2015-03-13 7:55 GMT-04:00 Uwe Sauter <[email protected]>: > > Hi, > > thanks for looking into this. > > After looking further into this it came to mind that the jobs were submitted > with --switches=1. As we have a max of 18 nodes per > switch the probability is high that there just wasn't a switch whose nodes > were idle although enough nodes in total were. > > So I think it had nothing to do with the GrpNodes or MaxNodes limitation for > that account but simply that the restriction of > switches=1 wasn't satisfied. > > It would be nice if the job reason would be a bit more detailed than it is > right now, at least for the admin who has to answer to > the users. > > Regards, > > Uwe > > > Am 13.03.2015 um 12:28 schrieb John Desantis: >> >> Uwe, >> >> I didn't see anything outstanding in the logs which would suggest an error. >> >> Looking at the "production" account, are there any other active jobs >> or allocations (by users in that account) in place using one node >> which would technically make the free nodes stand at 17 versus 18? >> What happens if you update the account so that there is no limit on >> GrpNodes? >> >> Have you checked on all free nodes which should be able to host the >> job for issues? Do you have any reservations in the system which are >> preserving resources? How does the partition configuration look? Are >> you specifying MaxNodes? >> >> If I were in your shoes, I'd test the account by using -w and >> --exclusive with 'salloc' on each of the nodes listed in the debug >> output below. I'd run it 18 times to test the GrpNodes limit. >> >> This is all that I can think of right now. I'll have another espresso >> soon enough and will reply if anything else comes to mind. I hope >> this helps! >> >> John DeSantis >> >> >> >> 2015-03-12 4:59 GMT-04:00 Uwe Sauter <[email protected]>: >>> >>> No one able to give a hint? >>> >>> Am 10.03.2015 um 17:05 schrieb Uwe Sauter: >>>> >>>> Hi, >>>> >>>> I have an account "production" configured with limitations GrpNodes=18, >>>> MaxNodes=18, MaxWall=7-00:00:00, an associated user with >>>> limitations MaxNodes=18, MaxWall=7-00:00:00 and a QoS with limitations >>>> Priority=10, GraceTime=00:00:00, PreemtMode=cluster, >>>> Flags=DenyOnLimit, UsageFact0r=1.0, MinCPUs=1. >>>> >>>> This user submitted a job that is within those limitations: >>>> >>>> JobId=14115 JobName=XXX >>>> UserId=XXX(XXX) GroupId=XXX(XXX) >>>> Priority=1214 Nice=0 Account=production QOS=normal >>>> JobState=PENDING Reason=Resources Dependency=(null) >>>> Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >>>> RunTime=00:00:00 TimeLimit=4-00:00:00 TimeMin=N/A >>>> SubmitTime=2015-03-10T13:08:56 EligibleTime=2015-03-10T13:08:56 >>>> StartTime=Unknown EndTime=Unknown >>>> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >>>> Partition=MyPartition AllocNode:Sid=frontend:15414 >>>> ReqNodeList=(null) ExcNodeList=(null) >>>> NodeList=(null) >>>> NumNodes=18-18 NumCPUs=180 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >>>> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 >>>> Features=(null) Gres=(null) Reservation=(null) >>>> Shared=0 Contiguous=0 Licenses=(null) Network=(null) >>>> Command=(null) >>>> WorkDir=/XXX >>>> StdErr=/XXX/slurm-14115.out >>>> StdIn=/dev/null >>>> StdOut=/XXX/slurm-14115.out >>>> Switches=1@1-00:00:00 >>>> >>>> Submitting the same job with a lower node count (e.g. 17) immediately >>>> starts the job on that account. There is a second job with >>>> lower priority for that account in the queue and enough free nodes in the >>>> cluster that a 18 node job is able to run. >>>> >>>> How can I debug what's going on and get this job running? Turning on >>>> scheduler debugging only shows: >>>> >>>> scheduler log: >>>> [2015-03-10T13:33:11.834] sched: JobId=14115. State=PENDING. >>>> Reason=Resources. Priority=1000. Partition=MyPartition. >>>> [2015-03-10T13:33:11.834] sched: JobId=14116. State=PENDING. >>>> Reason=Priority(Priority), Priority=1000, Partition=MyPartition. >>>> >>>> >>>> slurmctld log: >>>> [2015-03-10T15:01:42.259] Set DebugFlags to >>>> Backfill,BackfillMap,SelectType,TraceJobs >>>> [2015-03-10T15:02:32.508] >>>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0 >>>> [2015-03-10T15:03:02.178] backfill: beginning >>>> [2015-03-10T15:03:02.178] ========================================= >>>> [2015-03-10T15:03:02.178] Begin:2015-03-10T15:03:02 End:2015-03-11T15:03:02 >>>> Nodes:n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901] >>>> [2015-03-10T15:03:02.178] ========================================= >>>> [2015-03-10T15:03:02.178] backfill test for JobID=14115 Prio=1000 >>>> Partition=MyPartition >>>> [2015-03-10T15:03:02.178] Test job 14115 at 2015-03-10T15:03:02 on >>>> n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901] >>>> [2015-03-10T15:03:02.178] backfill test for JobID=14116 Prio=1000 >>>> Partition=MyPartition >>>> [2015-03-10T15:03:02.178] Test job 14116 at 2015-03-10T15:03:02 on >>>> n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901] >>>> [2015-03-10T15:03:02.178] backfill: reached end of job queue >>>> [2015-03-10T15:03:02.178] backfill: completed testing 2(2) jobs, usec=775 >>>> [2015-03-10T15:04:24.991] Set DebugFlags to none >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Uwe >>> >
