On Tue, 2013-07-02 at 13:41 +0000, Samir Cury wrote:
> Dear all,
> 
> Our setup is the SGE that comes in a Rocks Roll, in principle already
> automated/OOTB process to deploy it in the headnode/compute nodes with
> their respective roles.
> 
> Since our headnode's motherboard was replaced (in principle only
> affects MAC address change for eth0,eth1), we have been facing some
> problems with our SGE setup, I'd like to share the tests we did so
> far, and if possible get some advice on what other tests can be done
> to find the problem.
 
> [root@t3-local ~]# qstat -f
> queuename                      qtype resv/used/tot. load_avg arch
>     states
> ---------------------------------------------------------------------------------
> [email protected]        BIP   0/0/8          -NA-     lx26-amd64    au
> ---------------------------------------------------------------------------------
> [email protected]        BIP   0/8/8          0.05     lx26-amd64
> ---------------------------------------------------------------------------------
> [email protected]        BIP   0/8/8          0.09     lx26-amd64
> ---------------------------------------------------------------------------------
> [email protected]        BIP   0/8/8          0.05     lx26-amd64
> ---------------------------------------------------------------------------------
> [email protected]       BIP   0/16/1         -NA-     lx26-amd64    auo
> ---------------------------------------------------------------------------------
> [email protected]       BIP   0/16/1         -NA-     lx26-amd64    auo
> ---------------------------------------------------------------------------------
> [email protected]       BIP   0/16/1         -NA-     lx26-amd64    auo
> ---------------------------------------------------------------------------------
> [email protected]  BIP   0/0/4          0.09     lx26-amd64
> 
> 
The queue instances with 'o' in their state field are not configured to
exist as far as grid engine is concerned and are merely being retained
until the last job running in them finishes.  This is probably not what
you want.

I've seen occasions in the past where the queue instances don't match up
with what is configured in the cluster queue.

The problem may have manifested now because you've turned off the
qmaster for the first time in (presumably) a long while and the on disk
config doesn't quite match up with what was in memory prior to the
outage.   

If this is the case you could possibly get them reconfigured by issuing
a qconf -mq all.q making a trivial change (IIRC adding a space at the
end of a line is sufficient) and saving.

It may not help but it shouldn't hurt.

If the queues don't lose at least the 'o' state then examine the output
of qconf -sq all.q |grep '^hostlist' to see if the cluster queue
indicated they should be there.

Also qconf -sq all.q|grep '^slots' as you appear to have more slots
running there than you have 'configured'

compute-2-4.local is something else though (maybe just sge_execd down).


William

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to