I'm back again -- is it obvious that my new cluster just went into production? Again, we're running SoGE 8.1.9 on a cluster with nodes of several different sizes. We're running into an odd issue where SGE stops scheduling jobs despite available slots. The messages file contains many instances of messages like this:

02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue 
"ondemand.q@cc-hmid1" for job 142497.1
02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders

Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot run in the named queue instance -- an RQS limits the usage to 0 slots. Also, if I run "qalter -w p" on the job, it reports "verification: found possible assignment with 500 slots". But the job will never get scheduled. And neither will *any* other jobs. The only way I've found to get things flowing again is to stop and restart sgemaster.

Since it's possibly (probably?) related, I should say that I have max_reservation set to 1024 in the scheduler config. Also, I've had instances of this error in the past where the queue@host instance mentioned in the error is actually defined as having 0 slots. So it's not tied to the RQS.

Can anyone give me some pointers on how to debug this?  Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to