[slurm-dev] Re: Why do jobs get stuck?

Christopher B Coffey Wed, 04 Mar 2015 12:06:08 -0800

Hi Lennart,

Thanks for the reply!  The job didn’t have the reason 
“DependencyNeverSatisfied”.  It really did not make sense.  I think the user 
ended up killing the job.


Regarding the processing of the other jobs for backfill.  I keep seeing a odd 
phenomenon that seems to occur on 12-18 hr intervals.  Nodes go idle, and jobs 
are just sitting the in the queue.  They’ll have a message like 
“AssocGrpCPURunMinsLimit” (i’m using GrpCPURunMins to limit resource usage).  
But theres no way that the # of jobs running are equalling the number I have 
set.

I have bf_continue enabled, yet I still see this odd behavior.

Whats really frustrating is, I can do “scontrol reconfigure” and the jobs start 
flowing again immediately with all nodes being fully allocated.  Even then, 
jobs are not being backfilled like they should in my opinion anyway.

One particular user has 1000+ jobs in the queue (likely all at the front).  
These jobs are MPI type, and he is requesting 16 cpus for each job.  His jobs 
are flexible so he has tons of jobs running which is great.  But he has zero 
fairshare where other folks have all of their fairshare, yet their jobs are not 
starting in front of his as I’d expect. My thought is that backfill tuning is 
to blame here - but I can’t seem to sort it out.

Any thoughts?

Best,
Chris
On Mar 1, 2015, at 3:03 AM, Lennart Karlsson 
<[email protected]<mailto:[email protected]>> wrote:


On 02/27/2015 04:25 PM, Christopher B Coffey wrote:
Now and then I find that jobs get stuck and it doesn’t make sense.  In
this recent scenario I have one job from a user that has the highest
priority yet its not starting.  The job has a requirement of 2 cpus, and
100GB of memory.  This is available now, yet the job doesn’t start.  I can
create a job with the exact resource requirements and submit, and it
starts immediately.

Here are my scheduling parameters:

SchedulerParameters=bf_window=20160,bf_resolution=600,default_queue_depth=1
2968,bf_max_job_test=13000,bf_max_job_start=100,bf_interval=30,pack_serial_
at_end


Slurm 14.11.4.

While having the backfill debug turned on I see something interesting.
Backfill says it tested 9234 jobs, but there are 10268 job in the queue.
Why didn’t backfill test all of the jobs?  Maybe this is part of the
problem?

The only thing special about this users job was that it was part of a
chain of dependent jobs (which are all completed).

Is there any way to force a job to start?  I’ve tried many things to get
the job to start but it won’t: release, requeue … etc.

Any help would be great, thanks!

Best,
Chris

Hi Chris,

A wild guess is that the dependencies for the job are not fulfilled. In that
case, the "Reason" for not starting is "DependencyNeverSatisfied", and the
cure for not keeping such jobs in queue is to include a "kill_invalid_depend,"
among the scheduling parameters. (The old default behaviour, before version
14.11 I think, was to automatically cancel jobs that were lacking the
asked-for dependencies.)

Otherwise, please tell us the output of "scontrol show job" for that job, to
give the readers more information about the patient (i.e. job).

Sometimes the backfill algorithm finds not time to go to the bottom of
the waiting queue. There exists a quick fix for that problem: a scheduling
parameter "bf_continue", that allows the scheduler to continue down the
queue instead of (as is the most correct behaviour) restarting with an 
inspection
of the jobs with the highest priority.

Best wishes,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden

[slurm-dev] Re: Why do jobs get stuck?

Reply via email to