Hi William,
Thanks for the comments. It helped me to find problems/cleanup a bit the
system. I realized that all -31- nodes were deprecated for long and just
hanging "orphans" there. We also had jobs stuck on those entities that I
removed.
One funny behavior I noticed is even if I issue qconf -de for those nodes,
it only takes effect everywhere once I restart the master. (compute-2-4 is
down, and this is fine)
Still, the main problem might be unrelated, I have hosts that appear in
qhost -q [1]
But although their daemons are running fine, they don't show up in qstat -f
[2]
Or seem to serve slots to any queue, although they appear everywhere in the
configuration. I will share a bit of it here :
------------- qconf -sq all.q
qname all.q
hostlist @allhosts
slots 1,[compute-2-4.local=8],[compute-3-2.local=8], \
[compute-3-3.local=8],[compute-3-4.local=8], \
[compute-3-6.local=8],[compute-3-5.local=8], \
[compute-3-7.local=8],[compute-3-8.local=8], \
[compute-3-9.local=8],[compute-3-10.local=8], \
[compute-3-12.local=8],[compute-3-11.local=8], \
[t3-higgs.ext.domain=4],[compute-30-1.local=40]
------------- qconf -mhgrp @allhosts
group_name @allhosts
hostlist t3-higgs.ultralight.org compute-3-7.local compute-2-4.local \
compute-3-3.local compute-3-4.local compute-3-6.local \
compute-3-8.local compute-3-9.local compute-3-10.local \
compute-3-11.local compute-3-12.local compute-3-2.local \
compute-2-4.local compute-30-1.local compute-3-5.local
I think it just comes back to the FUTEX timeout, seems the only difference
I've seen between a working and non-working node. Let me know if you have
clues of what else to check. Network settings seem to be the same in a
working and non-working node.
Thanks,
Samir
[1] :
[root@compute-3-5 ~]# qhost -q
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
SWAPUS
-------------------------------------------------------------------------------
global - - - - - -
-
compute-2-2 lx26-amd64 8 - 23.5G - 4.0G
-
compute-2-4 lx26-amd64 8 - 23.5G - 4.0G
-
all.q BIP 0/0/8 au
compute-3-10 lx26-amd64 8 0.03 23.5G 847.6M 4.0G
196.0K
compute-3-11 lx26-amd64 8 0.04 23.5G 742.7M 4.0G
196.0K
compute-3-12 lx26-amd64 8 0.00 23.5G 1.0G 4.0G
196.0K
compute-3-2 lx26-amd64 8 0.06 23.5G 821.3M 4.0G
196.0K
all.q BIP 0/0/8
compute-3-3 lx26-amd64 8 0.00 23.5G 927.4M 4.0G
196.0K
compute-3-4 lx26-amd64 8 0.00 23.5G 617.4M 4.0G
24.6M
compute-3-5 lx26-amd64 8 0.10 23.5G 1.4G 4.0G
0.0
compute-3-6 lx26-amd64 16 0.17 23.5G 869.3M 4.0G
260.0K
compute-3-7 lx26-amd64 8 0.00 23.5G 741.6M 4.0G
39.5M
all.q BIP 0/0/8
compute-3-8 lx26-amd64 8 0.00 23.5G 668.8M 4.0G
24.1M
all.q BIP 0/0/8
compute-3-9 lx26-amd64 8 0.02 23.5G 670.4M 4.0G
196.0K
compute-30-1 lx26-amd64 80 0.04 62.9G 1.7G 4.0G
38.2M
t3-higgs lx26-amd64 8 0.00 23.5G 1.3G 4.0G
4.5M
all.q BIP 0/0/4
[2] :
[root@compute-3-5 ~]# qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
[email protected] BIP 0/0/8 -NA- lx26-amd64
au
---------------------------------------------------------------------------------
[email protected] BIP 0/0/8 0.05 lx26-amd64
---------------------------------------------------------------------------------
[email protected] BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
[email protected] BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
[email protected] BIP 0/0/4 0.00 lx26-amd64
On Wed, Jul 3, 2013 at 9:50 AM, William Hay <[email protected]> wrote:
> On Tue, 2013-07-02 at 13:41 +0000, Samir Cury wrote:
> > Dear all,
> >
> > Our setup is the SGE that comes in a Rocks Roll, in principle already
> > automated/OOTB process to deploy it in the headnode/compute nodes with
> > their respective roles.
> >
> > Since our headnode's motherboard was replaced (in principle only
> > affects MAC address change for eth0,eth1), we have been facing some
> > problems with our SGE setup, I'd like to share the tests we did so
> > far, and if possible get some advice on what other tests can be done
> > to find the problem.
>
> > [root@t3-local ~]# qstat -f
> > queuename qtype resv/used/tot. load_avg arch
> > states
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/0/8 -NA- lx26-amd64
> au
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/8/8 0.05 lx26-amd64
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/8/8 0.09 lx26-amd64
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/8/8 0.05 lx26-amd64
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/16/1 -NA- lx26-amd64
> auo
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/16/1 -NA- lx26-amd64
> auo
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/16/1 -NA- lx26-amd64
> auo
> >
> ---------------------------------------------------------------------------------
> > [email protected] BIP 0/0/4 0.09 lx26-amd64
> >
> >
> The queue instances with 'o' in their state field are not configured to
> exist as far as grid engine is concerned and are merely being retained
> until the last job running in them finishes. This is probably not what
> you want.
>
> I've seen occasions in the past where the queue instances don't match up
> with what is configured in the cluster queue.
>
> The problem may have manifested now because you've turned off the
> qmaster for the first time in (presumably) a long while and the on disk
> config doesn't quite match up with what was in memory prior to the
> outage.
>
> If this is the case you could possibly get them reconfigured by issuing
> a qconf -mq all.q making a trivial change (IIRC adding a space at the
> end of a line is sufficient) and saving.
>
> It may not help but it shouldn't hurt.
>
> If the queues don't lose at least the 'o' state then examine the output
> of qconf -sq all.q |grep '^hostlist' to see if the cluster queue
> indicated they should be there.
>
> Also qconf -sq all.q|grep '^slots' as you appear to have more slots
> running there than you have 'configured'
>
> compute-2-4.local is something else though (maybe just sge_execd down).
>
>
> William
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users