On Tue, Feb 06, 2018 at 12:13:24PM -0800, Joshua Baker-LePain wrote: > I'm back again -- is it obvious that my new cluster just went into > production? Again, we're running SoGE 8.1.9 on a cluster with nodes of > several different sizes. We're running into an odd issue where SGE stops > scheduling jobs despite available slots. The messages file contains many > instances of messages like this: > > 02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue > "ondemand.q@cc-hmid1" for job 142497.1 > 02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders > > Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot run in > the named queue instance -- an RQS limits the usage to 0 slots. Also, if I > run "qalter -w p" on the job, it reports "verification: found possible > assignment with 500 slots". But the job will never get scheduled. And > neither will *any* other jobs. The only way I've found to get things > flowing again is to stop and restart sgemaster. > > Since it's possibly (probably?) related, I should say that I have > max_reservation set to 1024 in the scheduler config. Also, I've had > instances of this error in the past where the queue@host instance mentioned > in the error is actually defined as having 0 slots. So it's not tied to the > RQS. > > Can anyone give me some pointers on how to debug this? Thanks.
IIRC resource quotas and reservations don't always play nicely together. The same error can come about for multiple different reasons so having had this error in the past when the queue is defined as having 0 slots doesn't eliminate RQS as a suspect. I would set MONITOR=1 in the sched_conf and have a look at the schedule file to see a little more detail about what is going on. As a slightly less drastic method than restarting the qmaster you could try reducing the priority (qalter -p) on the problem job for a scheduling cycle to below the jobs stuck behind it to see if they will start even if the problem job won't. I don't know how your cluster is set up but I would try to tweak the config so that larger jobs (that need reservations) don't even consider queue instances that are constrained by RQS. William
signature.asc
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
