I'm back again -- is it obvious that my new cluster just went into
production? Again, we're running SoGE 8.1.9 on a cluster with nodes of
several different sizes. We're running into an odd issue where SGE stops
scheduling jobs despite available slots. The messages file contains many
instances of messages like this:
02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue
"ondemand.q@cc-hmid1" for job 142497.1
02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders
Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot run
in the named queue instance -- an RQS limits the usage to 0 slots. Also,
if I run "qalter -w p" on the job, it reports "verification: found
possible assignment with 500 slots". But the job will never get
scheduled. And neither will *any* other jobs. The only way I've found to
get things flowing again is to stop and restart sgemaster.
Since it's possibly (probably?) related, I should say that I have
max_reservation set to 1024 in the scheduler config. Also, I've had
instances of this error in the past where the queue@host instance
mentioned in the error is actually defined as having 0 slots. So it's not
tied to the RQS.
Can anyone give me some pointers on how to debug this? Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users