In the message dated: Mon, 17 Dec 2012 12:26:31 PST,
The pithy ruminations from Joseph Farran on 
<Re: [gridengine users] Restarting Grid Engine makes qstat forget display order
> were:
=> On 12/16/2012 10:15 AM, Dave Love wrote:
=> > I think the answer is not to do that.  Why restart it?
=> >
=> 
=> Since restarting GE server is not harmful and because Murphy always shows up 
on a Friday night o

Except that restarting it is "harmful", at a minimum in the example you
gave (qstat display order changes), as well as preventing submissions
while the server is down, leaving "orphaned" jobs in the queue (ie.,
jobs that finished while the server was down are not removed from the
list of running jobs), etc.

=> n the eve of a long 3 day weekend, sometimes restarting services (which are 
safe to restart) is 
=> a good thing.
=> 
=> Before I switch to Grid Engine, we were running Torque/PBS and restarting 
that service nightly m
=> ade all the difference in the world - yes I know GE is not Torque.
=> 
=> What advice do you have and/or scripts that check for Grid Engine not 
scheduling jobs and restar
=> ting it automatically?    I don't mean checking to make sure sge_qmaster is 
running, but rather 

Monitoring the existence and health of system daemons depends a lot
on your monitoring & configuration system. For example, the advice for
checking the status of SGE will vary if you're using Nagios, cfengine,
etc.

=> that the scheduling process is working?

Let's turn the question around...

        Are you having a problem where sge_qmaster is running but the
        scheduling process is not working?

If so, then that's the thing to solve.

In our environment, I've never seen that failure mode. We probably
experience 3~5 restarts of the sge_qmaster annually--a mild irritation,
but a much shorter cumulative outage that doing a daily restart.

=> 
=> I know I can do a simple qrsh with some expected result and check for that, 
but then I would nee
=> d a dedicated node for times when all nodes are in use.

Or submit a job with a very high priority, to a queue that's configured to
subordinate (suspend) other jobs, and set a reasonable timer (with periodic
checks to see if the job is waiting in the queue, checks for whether "qalter
-w v" reports that SGE has found a place to run the job, etc.) before
declaring that the wait for results is equivalent to a problem with SGE.

Besides, the "simple qrsh with some expected result" may be a fair
end-to-end test, but it has many failure modes that would not be
resolved by restarting the sge_qmaster (for example, we've see some of
the following: network down between qmaster & compute nodes, disk full
on compute node, random memory error on compute node causing qrsh job
to segfault, disk full on sge_qmaster, directory services failure on
compute node causes failure to establish remote session, etc.) that would not
be fixed by retarting the queue master.

Doing an automated end-to-end test of the SGE system (submitting a job,
comparing results to a known quantity) is a good monitoring technique,
but it doesn't sufficiently pin-point the cause of any failures to
automate a response.

Mark

=> _______________________________________________
=> users mailing list
=> [email protected]
=> https://gridengine.org/mailman/listinfo/users
=> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to