[gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

Joshua Baker-LePain Tue, 06 Feb 2018 12:16:57 -0800

I'm back again -- is it obvious that my new cluster just went intoproduction? Again, we're running SoGE 8.1.9 on a cluster with nodes ofseveral different sizes. We're running into an odd issue where SGE stopsscheduling jobs despite available slots. The messages file contains manyinstances of messages like this:


02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue 
"ondemand.q@cc-hmid1" for job 142497.1
02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders

Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot runin the named queue instance -- an RQS limits the usage to 0 slots. Also,if I run "qalter -w p" on the job, it reports "verification: foundpossible assignment with 500 slots". But the job will never getscheduled. And neither will *any* other jobs. The only way I've found toget things flowing again is to stop and restart sgemaster.

Since it's possibly (probably?) related, I should say that I havemax_reservation set to 1024 in the scheduler config. Also, I've hadinstances of this error in the past where the queue@host instancementioned in the error is actually defined as having 0 slots. So it's nottied to the RQS.


Can anyone give me some pointers on how to debug this?  Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Scheduler getting stuck, "Skipping remaining N orders"

Reply via email to