[slurm-dev] Re: Why do jobs get stuck?

Lennart Karlsson Fri, 06 Mar 2015 07:46:38 -0800


On 03/04/2015 09:05 PM, Christopher B Coffey wrote:

Thanks for the reply!  The job didn’t have the reason 
“DependencyNeverSatisfied”.  It really did not make sense.  I think the user 
ended up killing the job.


Regarding the processing of the other jobs for backfill.  I keep seeing a odd 
phenomenon that seems to occur on 12-18 hr intervals.  Nodes go idle, and jobs 
are just sitting the in the queue.  They’ll have a message like 
“AssocGrpCPURunMinsLimit” (i’m using GrpCPURunMins to limit resource usage).  
But theres no way that the # of jobs running are equalling the number I have 
set.

I have bf_continue enabled, yet I still see this odd behavior.

Whats really frustrating is, I can do “scontrol reconfigure” and the jobs start 
flowing again immediately with all nodes being fully allocated.  Even then, 
jobs are not being backfilled like they should in my opinion anyway.

One particular user has 1000+ jobs in the queue (likely all at the front).  
These jobs are MPI type, and he is requesting 16 cpus for each job.  His jobs 
are flexible so he has tons of jobs running which is great.  But he has zero 
fairshare where other folks have all of their fairshare, yet their jobs are not 
starting in front of his as I’d expect. My thought is that backfill tuning is 
to blame here - but I can’t seem to sort it out.

Any thoughts?

Best,
Chris


Hi Chris,

"AssocGrpCPURunMinsLimit" limits not the number of jobs running, but the number 
of
core minutes that the account may allocate at the same time, counted on running 
jobs
plus the specific job in queue.  But perhaps you already know that.

I think I’ve discovered whats causing the jobs to get stuck, but not the
reason.  We have a long partition and for some reason jobs aren’t getting
backfilled out of it.  Plus, slurm specifies the start times for those
jobs in the long partition as being max wall time from now.

Any thoughts?


Sorry, I have not any good ideas on how to solve this problem.

What is a long partition? A partition with a high MaxTime?

What do you mean by "jobs aren't getting backfilled out of it"? Is it not the 
other
way round, that jobs shall start within the partition?

Can you send me a copy of your slurm.conf, output from your "date" command and
a "scontol show job" on a job that does not start? Perhaps that can give me some
clue...

Best wishes
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden

[slurm-dev] Re: Why do jobs get stuck?

Reply via email to