Re: [gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-17 Thread Dave Love
Joseph Farran jfar...@uci.edu writes:

 Howdy.

 This is minor issue but one I like to see if there is a fix for.

 I re-start Grid Engine 8.1.2 every day via a cron job.   

 I noticed that the qstat listing changes the display order when GE is
 restarted.

I think the answer is not to do that.  Why restart it?

 Before the restart, jobs are listed in the order submitted ( by job ID ) but
 after the restart, it's kind of random order.

 Is there any way to keep the original qstat display when GE is restarted?

Not without adding code to sort the list.  I don't know if there's a
good reason it's only done for the pending jobs.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-17 Thread Joseph Farran

On 12/16/2012 10:15 AM, Dave Love wrote:

I think the answer is not to do that.  Why restart it?



Since restarting GE server is not harmful and because Murphy always shows up on 
a Friday night on the eve of a long 3 day weekend, sometimes restarting 
services (which are safe to restart) is a good thing.

Before I switch to Grid Engine, we were running Torque/PBS and restarting that 
service nightly made all the difference in the world - yes I know GE is not 
Torque.

What advice do you have and/or scripts that check for Grid Engine not 
scheduling jobs and restarting it automatically?I don't mean checking to 
make sure sge_qmaster is running, but rather that the scheduling process is 
working?

I know I can do a simple qrsh with some expected result and check for that, but 
then I would need a dedicated node for times when all nodes are in use.
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-17 Thread bergman


In the message dated: Mon, 17 Dec 2012 12:26:31 PST,
The pithy ruminations from Joseph Farran on 
Re: [gridengine users] Restarting Grid Engine makes qstat forget display order
 were:
= On 12/16/2012 10:15 AM, Dave Love wrote:
=  I think the answer is not to do that.  Why restart it?
= 
= 
= Since restarting GE server is not harmful and because Murphy always shows up 
on a Friday night o

Except that restarting it is harmful, at a minimum in the example you
gave (qstat display order changes), as well as preventing submissions
while the server is down, leaving orphaned jobs in the queue (ie.,
jobs that finished while the server was down are not removed from the
list of running jobs), etc.

= n the eve of a long 3 day weekend, sometimes restarting services (which are 
safe to restart) is 
= a good thing.
= 
= Before I switch to Grid Engine, we were running Torque/PBS and restarting 
that service nightly m
= ade all the difference in the world - yes I know GE is not Torque.
= 
= What advice do you have and/or scripts that check for Grid Engine not 
scheduling jobs and restar
= ting it automatically?I don't mean checking to make sure sge_qmaster is 
running, but rather 

Monitoring the existence and health of system daemons depends a lot
on your monitoring  configuration system. For example, the advice for
checking the status of SGE will vary if you're using Nagios, cfengine,
etc.

= that the scheduling process is working?

Let's turn the question around...

Are you having a problem where sge_qmaster is running but the
scheduling process is not working?

If so, then that's the thing to solve.

In our environment, I've never seen that failure mode. We probably
experience 3~5 restarts of the sge_qmaster annually--a mild irritation,
but a much shorter cumulative outage that doing a daily restart.

= 
= I know I can do a simple qrsh with some expected result and check for that, 
but then I would nee
= d a dedicated node for times when all nodes are in use.

Or submit a job with a very high priority, to a queue that's configured to
subordinate (suspend) other jobs, and set a reasonable timer (with periodic
checks to see if the job is waiting in the queue, checks for whether qalter
-w v reports that SGE has found a place to run the job, etc.) before
declaring that the wait for results is equivalent to a problem with SGE.

Besides, the simple qrsh with some expected result may be a fair
end-to-end test, but it has many failure modes that would not be
resolved by restarting the sge_qmaster (for example, we've see some of
the following: network down between qmaster  compute nodes, disk full
on compute node, random memory error on compute node causing qrsh job
to segfault, disk full on sge_qmaster, directory services failure on
compute node causes failure to establish remote session, etc.) that would not
be fixed by retarting the queue master.

Doing an automated end-to-end test of the SGE system (submitting a job,
comparing results to a known quantity) is a good monitoring technique,
but it doesn't sufficiently pin-point the cause of any failures to
automate a response.

Mark

= ___
= users mailing list
= users@gridengine.org
= https://gridengine.org/mailman/listinfo/users
= 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-16 Thread Joseph Farran

  
  
Howdy.
  
  This is minor issue but one I like to see if there is a fix for.
  
  I re-start Grid Engine 8.1.2 every day via a cron job. 
  
  I noticed that the qstat listing changes the display order when GE
  is restarted.
  
  Before the restart, jobs are listed in the order submitted ( by
  job ID ) but after the restart, it's kind of random order.
  
  Is there any way to keep the original qstat display when GE is
  restarted?
  
  Thanks,
  Joseph
  

  

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users