No one able to give a hint?
Am 10.03.2015 um 17:05 schrieb Uwe Sauter: > > Hi, > > I have an account "production" configured with limitations GrpNodes=18, > MaxNodes=18, MaxWall=7-00:00:00, an associated user with > limitations MaxNodes=18, MaxWall=7-00:00:00 and a QoS with limitations > Priority=10, GraceTime=00:00:00, PreemtMode=cluster, > Flags=DenyOnLimit, UsageFact0r=1.0, MinCPUs=1. > > This user submitted a job that is within those limitations: > > JobId=14115 JobName=XXX > UserId=XXX(XXX) GroupId=XXX(XXX) > Priority=1214 Nice=0 Account=production QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=4-00:00:00 TimeMin=N/A > SubmitTime=2015-03-10T13:08:56 EligibleTime=2015-03-10T13:08:56 > StartTime=Unknown EndTime=Unknown > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=MyPartition AllocNode:Sid=frontend:15414 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=18-18 NumCPUs=180 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=(null) > Shared=0 Contiguous=0 Licenses=(null) Network=(null) > Command=(null) > WorkDir=/XXX > StdErr=/XXX/slurm-14115.out > StdIn=/dev/null > StdOut=/XXX/slurm-14115.out > Switches=1@1-00:00:00 > > Submitting the same job with a lower node count (e.g. 17) immediately starts > the job on that account. There is a second job with > lower priority for that account in the queue and enough free nodes in the > cluster that a 18 node job is able to run. > > How can I debug what's going on and get this job running? Turning on > scheduler debugging only shows: > > scheduler log: > [2015-03-10T13:33:11.834] sched: JobId=14115. State=PENDING. > Reason=Resources. Priority=1000. Partition=MyPartition. > [2015-03-10T13:33:11.834] sched: JobId=14116. State=PENDING. > Reason=Priority(Priority), Priority=1000, Partition=MyPartition. > > > slurmctld log: > [2015-03-10T15:01:42.259] Set DebugFlags to > Backfill,BackfillMap,SelectType,TraceJobs > [2015-03-10T15:02:32.508] > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0 > [2015-03-10T15:03:02.178] backfill: beginning > [2015-03-10T15:03:02.178] ========================================= > [2015-03-10T15:03:02.178] Begin:2015-03-10T15:03:02 End:2015-03-11T15:03:02 > Nodes:n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901] > [2015-03-10T15:03:02.178] ========================================= > [2015-03-10T15:03:02.178] backfill test for JobID=14115 Prio=1000 > Partition=MyPartition > [2015-03-10T15:03:02.178] Test job 14115 at 2015-03-10T15:03:02 on > n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901] > [2015-03-10T15:03:02.178] backfill test for JobID=14116 Prio=1000 > Partition=MyPartition > [2015-03-10T15:03:02.178] Test job 14116 at 2015-03-10T15:03:02 on > n[502901,503001,503101,503201,503301,510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601,512901,513001,513101,513201,513301,513501,513601,513701,513801,513901,520301,520401,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601,523701,523801,523901] > [2015-03-10T15:03:02.178] backfill: reached end of job queue > [2015-03-10T15:03:02.178] backfill: completed testing 2(2) jobs, usec=775 > [2015-03-10T15:04:24.991] Set DebugFlags to none > > > > > Thanks, > > Uwe >
