On Wed, 7 Feb 2018 at 12:46am, William Hay wrote

IIRC resource quotas and reservations don't always play nicely together.
The same error can come about for multiple different reasons so having
had this error in the past when the queue is defined as having 0 slots
doesn't eliminate RQS as a suspect.

I would set MONITOR=1 in the sched_conf and have a look at the schedule
file to see a little more detail about what is going on.

I've done this and will have a look next time something gets stuck.

As a slightly less drastic method than restarting the qmaster you could
try reducing the priority (qalter -p)  on the problem job for a scheduling
cycle to below the jobs stuck behind it to see if they will start even
if the problem job won't.

IIRC from past episodes, anything *submitted* after the "skipping orders" messages start appearing is not assigned a priority -- they sit there in the queue at 0.000. But I can certainly try this if there *is* something with a priority score that's stuck behind the problem job.

I don't know how your cluster is set up but I would try to tweak the
config so that larger jobs (that need reservations) don't even consider
queue instances that are constrained by RQS.

Unforunately, RQSes are pretty integral to our setup. Our cluster is run as a co-op model. Each lab gets a number of slots (limited via RQS) in our high priority queue proportional to their "ownership" of the cluster. Jobs in the lower priority queues run niced and have fewer available slots. I'll certainly ask the users to try without reservations to see a) if they can still get their jobs through and b) if that keeps the error from cropping up.

Is the lack of fair play between RQSes and reservations a bug or simply a side effect of how these 2 systems operate?

Thanks a bunch for getting back to me.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to