Hi, Now and then I find that jobs get stuck and it doesn’t make sense. In this recent scenario I have one job from a user that has the highest priority yet its not starting. The job has a requirement of 2 cpus, and 100GB of memory. This is available now, yet the job doesn’t start. I can create a job with the exact resource requirements and submit, and it starts immediately.
Here are my scheduling parameters: SchedulerParameters=bf_window=20160,bf_resolution=600,default_queue_depth=1 2968,bf_max_job_test=13000,bf_max_job_start=100,bf_interval=30,pack_serial_ at_end Slurm 14.11.4. While having the backfill debug turned on I see something interesting. Backfill says it tested 9234 jobs, but there are 10268 job in the queue. Why didn’t backfill test all of the jobs? Maybe this is part of the problem? The only thing special about this users job was that it was part of a chain of dependent jobs (which are all completed). Is there any way to force a job to start? I’ve tried many things to get the job to start but it won’t: release, requeue … etc. Any help would be great, thanks! Best, Chris
