Hi,

Now and then I find that jobs get stuck and it doesn’t make sense.  In
this recent scenario I have one job from a user that has the highest
priority yet its not starting.  The job has a requirement of 2 cpus, and
100GB of memory.  This is available now, yet the job doesn’t start.  I can
create a job with the exact resource requirements and submit, and it
starts immediately.

Here are my scheduling parameters:

SchedulerParameters=bf_window=20160,bf_resolution=600,default_queue_depth=1
2968,bf_max_job_test=13000,bf_max_job_start=100,bf_interval=30,pack_serial_
at_end


Slurm 14.11.4.

While having the backfill debug turned on I see something interesting.
Backfill says it tested 9234 jobs, but there are 10268 job in the queue.
Why didn’t backfill test all of the jobs?  Maybe this is part of the
problem?

The only thing special about this users job was that it was part of a
chain of dependent jobs (which are all completed).

Is there any way to force a job to start?  I’ve tried many things to get
the job to start but it won’t: release, requeue … etc.

Any help would be great, thanks!

Best,
Chris


Reply via email to