On Tue, Feb 06, 2018 at 12:13:24PM -0800, Joshua Baker-LePain wrote:
> I'm back again -- is it obvious that my new cluster just went into
> production?  Again, we're running SoGE 8.1.9 on a cluster with nodes of
> several different sizes.  We're running into an odd issue where SGE stops
> scheduling jobs despite available slots.  The messages file contains many
> instances of messages like this:
> 
> 02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue 
> "ondemand.q@cc-hmid1" for job 142497.1
> 02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders
> 
> Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot run in
> the named queue instance -- an RQS limits the usage to 0 slots.  Also, if I
> run "qalter -w p" on the job, it reports "verification: found possible
> assignment with 500 slots".  But the job will never get scheduled.  And
> neither will *any* other jobs.  The only way I've found to get things
> flowing again is to stop and restart sgemaster.
> 
> Since it's possibly (probably?) related, I should say that I have
> max_reservation set to 1024 in the scheduler config.  Also, I've had
> instances of this error in the past where the queue@host instance mentioned
> in the error is actually defined as having 0 slots.  So it's not tied to the
> RQS.
> 
> Can anyone give me some pointers on how to debug this?  Thanks.

IIRC resource quotas and reservations don't always play nicely together.
The same error can come about for multiple different reasons so having
had this error in the past when the queue is defined as having 0 slots
doesn't eliminate RQS as a suspect.

I would set MONITOR=1 in the sched_conf and have a look at the schedule 
file to see a little more detail about what is going on.

As a slightly less drastic method than restarting the qmaster you could
try reducing the priority (qalter -p)  on the problem job for a scheduling
cycle to below the jobs stuck behind it to see if they will start even
if the problem job won't.

I don't know how your cluster is set up but I would try to tweak the
config so that larger jobs (that need reservations) don't even consider
queue instances that are constrained by RQS.

William  

Attachment: signature.asc
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to