Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

William Hay Thu, 08 Feb 2018 07:01:34 -0800

On Wed, Feb 07, 2018 at 02:15:05PM -0800, Joshua Baker-LePain wrote:
> On Wed, 7 Feb 2018 at 12:46am, William Hay wrote
> 
> > IIRC resource quotas and reservations don't always play nicely together.
> > The same error can come about for multiple different reasons so having
> > had this error in the past when the queue is defined as having 0 slots
> > doesn't eliminate RQS as a suspect.
> > 
> > I would set MONITOR=1 in the sched_conf and have a look at the schedule
> > file to see a little more detail about what is going on.
> 
> I've done this and will have a look next time something gets stuck.
> 
> > As a slightly less drastic method than restarting the qmaster you could
> > try reducing the priority (qalter -p)  on the problem job for a scheduling
> > cycle to below the jobs stuck behind it to see if they will start even
> > if the problem job won't.
> 
> IIRC from past episodes, anything *submitted* after the "skipping orders"
> messages start appearing is not assigned a priority -- they sit there in the
> queue at 0.000.  But I can certainly try this if there *is* something with a
> priority score that's stuck behind the problem job.


The 0.0000 you see in qstat is the qmaster thread's idea of priority.  Part of 
the
order that is skipped is updating the qmaster thread's notion of priority for a 
job.  The scheduler has usually assigned them  a priority lower than your 
problem job 
so they don't get updated because the order to update the qmaster's notion of 
priority for 
that job has been skipped.  Should circumstances change so that a later 
submitted
job  has higher priority than the problem job then the real priority used by 
the 
scheduler thread will become visible.  You can see a similar phenomenon with 
errored
jobs.  The scheduler doesn't consider them so the priority isn't updated.

> 
> > I don't know how your cluster is set up but I would try to tweak the
> > config so that larger jobs (that need reservations) don't even consider
> > queue instances that are constrained by RQS.
> 
> Unforunately, RQSes are pretty integral to our setup.  Our cluster is run as
> a co-op model.  Each lab gets a number of slots (limited via RQS) in our
> high priority queue proportional to their "ownership" of the cluster. Jobs
> in the lower priority queues run niced and have fewer available slots.  I'll
> certainly ask the users to try without reservations to see a) if they can
> still get their jobs through and b) if that keeps the error from cropping
> up.
> 
> Is the lack of fair play between RQSes and reservations a bug or simply a
> side effect of how these 2 systems operate?

Mostly it is that some of the scheduler heuristics can lead to odd scheduling 
decisions when combined with RQS.

William

signature.asc
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

Reply via email to