Hi Lennart, Thanks for the reply! The job didn’t have the reason “DependencyNeverSatisfied”. It really did not make sense. I think the user ended up killing the job.
Regarding the processing of the other jobs for backfill. I keep seeing a odd phenomenon that seems to occur on 12-18 hr intervals. Nodes go idle, and jobs are just sitting the in the queue. They’ll have a message like “AssocGrpCPURunMinsLimit” (i’m using GrpCPURunMins to limit resource usage). But theres no way that the # of jobs running are equalling the number I have set. I have bf_continue enabled, yet I still see this odd behavior. Whats really frustrating is, I can do “scontrol reconfigure” and the jobs start flowing again immediately with all nodes being fully allocated. Even then, jobs are not being backfilled like they should in my opinion anyway. One particular user has 1000+ jobs in the queue (likely all at the front). These jobs are MPI type, and he is requesting 16 cpus for each job. His jobs are flexible so he has tons of jobs running which is great. But he has zero fairshare where other folks have all of their fairshare, yet their jobs are not starting in front of his as I’d expect. My thought is that backfill tuning is to blame here - but I can’t seem to sort it out. Any thoughts? Best, Chris On Mar 1, 2015, at 3:03 AM, Lennart Karlsson <[email protected]<mailto:[email protected]>> wrote: On 02/27/2015 04:25 PM, Christopher B Coffey wrote: Now and then I find that jobs get stuck and it doesn’t make sense. In this recent scenario I have one job from a user that has the highest priority yet its not starting. The job has a requirement of 2 cpus, and 100GB of memory. This is available now, yet the job doesn’t start. I can create a job with the exact resource requirements and submit, and it starts immediately. Here are my scheduling parameters: SchedulerParameters=bf_window=20160,bf_resolution=600,default_queue_depth=1 2968,bf_max_job_test=13000,bf_max_job_start=100,bf_interval=30,pack_serial_ at_end Slurm 14.11.4. While having the backfill debug turned on I see something interesting. Backfill says it tested 9234 jobs, but there are 10268 job in the queue. Why didn’t backfill test all of the jobs? Maybe this is part of the problem? The only thing special about this users job was that it was part of a chain of dependent jobs (which are all completed). Is there any way to force a job to start? I’ve tried many things to get the job to start but it won’t: release, requeue … etc. Any help would be great, thanks! Best, Chris Hi Chris, A wild guess is that the dependencies for the job are not fulfilled. In that case, the "Reason" for not starting is "DependencyNeverSatisfied", and the cure for not keeping such jobs in queue is to include a "kill_invalid_depend," among the scheduling parameters. (The old default behaviour, before version 14.11 I think, was to automatically cancel jobs that were lacking the asked-for dependencies.) Otherwise, please tell us the output of "scontrol show job" for that job, to give the readers more information about the patient (i.e. job). Sometimes the backfill algorithm finds not time to go to the bottom of the waiting queue. There exists a quick fix for that problem: a scheduling parameter "bf_continue", that allows the scheduler to continue down the queue instead of (as is the most correct behaviour) restarting with an inspection of the jobs with the highest priority. Best wishes, -- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
