Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

Joshua Baker-LePain Wed, 07 Feb 2018 14:17:58 -0800

On Wed, 7 Feb 2018 at 12:46am, William Hay wrote

IIRC resource quotas and reservations don't always play nicely together.
The same error can come about for multiple different reasons so having
had this error in the past when the queue is defined as having 0 slots
doesn't eliminate RQS as a suspect.


I would set MONITOR=1 in the sched_conf and have a look at the schedule
file to see a little more detail about what is going on.


I've done this and will have a look next time something gets stuck.

As a slightly less drastic method than restarting the qmaster you could
try reducing the priority (qalter -p)  on the problem job for a scheduling
cycle to below the jobs stuck behind it to see if they will start even
if the problem job won't.

IIRC from past episodes, anything *submitted* after the "skipping orders"messages start appearing is not assigned a priority -- they sit there inthe queue at 0.000. But I can certainly try this if there *is* somethingwith a priority score that's stuck behind the problem job.

I don't know how your cluster is set up but I would try to tweak the
config so that larger jobs (that need reservations) don't even consider
queue instances that are constrained by RQS.

Unforunately, RQSes are pretty integral to our setup. Our cluster is runas a co-op model. Each lab gets a number of slots (limited via RQS) inour high priority queue proportional to their "ownership" of the cluster.Jobs in the lower priority queues run niced and have fewer availableslots. I'll certainly ask the users to try without reservations to see a)if they can still get their jobs through and b) if that keeps the errorfrom cropping up.

Is the lack of fair play between RQSes and reservations a bug or simply aside effect of how these 2 systems operate?


Thanks a bunch for getting back to me.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

Reply via email to