On Tue, 2013-07-02 at 13:41 +0000, Samir Cury wrote: > Dear all, > > Our setup is the SGE that comes in a Rocks Roll, in principle already > automated/OOTB process to deploy it in the headnode/compute nodes with > their respective roles. > > Since our headnode's motherboard was replaced (in principle only > affects MAC address change for eth0,eth1), we have been facing some > problems with our SGE setup, I'd like to share the tests we did so > far, and if possible get some advice on what other tests can be done > to find the problem. > [root@t3-local ~]# qstat -f > queuename qtype resv/used/tot. load_avg arch > states > --------------------------------------------------------------------------------- > [email protected] BIP 0/0/8 -NA- lx26-amd64 au > --------------------------------------------------------------------------------- > [email protected] BIP 0/8/8 0.05 lx26-amd64 > --------------------------------------------------------------------------------- > [email protected] BIP 0/8/8 0.09 lx26-amd64 > --------------------------------------------------------------------------------- > [email protected] BIP 0/8/8 0.05 lx26-amd64 > --------------------------------------------------------------------------------- > [email protected] BIP 0/16/1 -NA- lx26-amd64 auo > --------------------------------------------------------------------------------- > [email protected] BIP 0/16/1 -NA- lx26-amd64 auo > --------------------------------------------------------------------------------- > [email protected] BIP 0/16/1 -NA- lx26-amd64 auo > --------------------------------------------------------------------------------- > [email protected] BIP 0/0/4 0.09 lx26-amd64 > > The queue instances with 'o' in their state field are not configured to exist as far as grid engine is concerned and are merely being retained until the last job running in them finishes. This is probably not what you want.
I've seen occasions in the past where the queue instances don't match up with what is configured in the cluster queue. The problem may have manifested now because you've turned off the qmaster for the first time in (presumably) a long while and the on disk config doesn't quite match up with what was in memory prior to the outage. If this is the case you could possibly get them reconfigured by issuing a qconf -mq all.q making a trivial change (IIRC adding a space at the end of a line is sufficient) and saving. It may not help but it shouldn't hurt. If the queues don't lose at least the 'o' state then examine the output of qconf -sq all.q |grep '^hostlist' to see if the cluster queue indicated they should be there. Also qconf -sq all.q|grep '^slots' as you appear to have more slots running there than you have 'configured' compute-2-4.local is something else though (maybe just sge_execd down). William
signature.asc
Description: This is a digitally signed message part
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
