In the message dated: Mon, 17 Dec 2012 12:26:31 PST,
The pithy ruminations from Joseph Farran on
Re: [gridengine users] Restarting Grid Engine makes qstat forget display order
were:
= On 12/16/2012 10:15 AM, Dave Love wrote:
= I think the answer is not to do that. Why restart it?
=
=
= Since restarting GE server is not harmful and because Murphy always shows up
on a Friday night o
Except that restarting it is harmful, at a minimum in the example you
gave (qstat display order changes), as well as preventing submissions
while the server is down, leaving orphaned jobs in the queue (ie.,
jobs that finished while the server was down are not removed from the
list of running jobs), etc.
= n the eve of a long 3 day weekend, sometimes restarting services (which are
safe to restart) is
= a good thing.
=
= Before I switch to Grid Engine, we were running Torque/PBS and restarting
that service nightly m
= ade all the difference in the world - yes I know GE is not Torque.
=
= What advice do you have and/or scripts that check for Grid Engine not
scheduling jobs and restar
= ting it automatically?I don't mean checking to make sure sge_qmaster is
running, but rather
Monitoring the existence and health of system daemons depends a lot
on your monitoring configuration system. For example, the advice for
checking the status of SGE will vary if you're using Nagios, cfengine,
etc.
= that the scheduling process is working?
Let's turn the question around...
Are you having a problem where sge_qmaster is running but the
scheduling process is not working?
If so, then that's the thing to solve.
In our environment, I've never seen that failure mode. We probably
experience 3~5 restarts of the sge_qmaster annually--a mild irritation,
but a much shorter cumulative outage that doing a daily restart.
=
= I know I can do a simple qrsh with some expected result and check for that,
but then I would nee
= d a dedicated node for times when all nodes are in use.
Or submit a job with a very high priority, to a queue that's configured to
subordinate (suspend) other jobs, and set a reasonable timer (with periodic
checks to see if the job is waiting in the queue, checks for whether qalter
-w v reports that SGE has found a place to run the job, etc.) before
declaring that the wait for results is equivalent to a problem with SGE.
Besides, the simple qrsh with some expected result may be a fair
end-to-end test, but it has many failure modes that would not be
resolved by restarting the sge_qmaster (for example, we've see some of
the following: network down between qmaster compute nodes, disk full
on compute node, random memory error on compute node causing qrsh job
to segfault, disk full on sge_qmaster, directory services failure on
compute node causes failure to establish remote session, etc.) that would not
be fixed by retarting the queue master.
Doing an automated end-to-end test of the SGE system (submitting a job,
comparing results to a known quantity) is a good monitoring technique,
but it doesn't sufficiently pin-point the cause of any failures to
automate a response.
Mark
= ___
= users mailing list
= users@gridengine.org
= https://gridengine.org/mailman/listinfo/users
=
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users