Hi William,

Thanks for the comments. It helped me to find problems/cleanup a bit the
system. I realized that all -31- nodes were deprecated for long and just
hanging "orphans" there. We also had jobs stuck on those entities that I
removed.

One funny behavior I noticed is even if I issue qconf -de for those nodes,
it only takes effect everywhere once I restart the master. (compute-2-4 is
down, and this is fine)

Still, the main problem might be unrelated, I have hosts that appear in
qhost -q   [1]

But although their daemons are running fine, they don't show up in qstat -f
 [2]

Or seem to serve slots to any queue, although they appear everywhere in the
configuration. I will share a bit of it here :

------------- qconf -sq all.q
qname                 all.q
hostlist              @allhosts
slots                 1,[compute-2-4.local=8],[compute-3-2.local=8], \
                      [compute-3-3.local=8],[compute-3-4.local=8], \
                      [compute-3-6.local=8],[compute-3-5.local=8], \
                      [compute-3-7.local=8],[compute-3-8.local=8], \
                      [compute-3-9.local=8],[compute-3-10.local=8], \
                      [compute-3-12.local=8],[compute-3-11.local=8], \
                      [t3-higgs.ext.domain=4],[compute-30-1.local=40]

------------- qconf -mhgrp @allhosts
group_name @allhosts
hostlist t3-higgs.ultralight.org compute-3-7.local compute-2-4.local \
         compute-3-3.local compute-3-4.local compute-3-6.local \
         compute-3-8.local compute-3-9.local compute-3-10.local \
         compute-3-11.local compute-3-12.local compute-3-2.local \
         compute-2-4.local compute-30-1.local compute-3-5.local

I think it just comes back to the FUTEX timeout, seems the only difference
I've seen between a working and non-working node. Let me know if you have
clues of what else to check. Network settings seem to be the same in a
working and non-working node.

Thanks,
Samir


[1] :
[root@compute-3-5 ~]# qhost -q
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
 SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -
  -
compute-2-2             lx26-amd64      8     -   23.5G       -    4.0G
  -
compute-2-4             lx26-amd64      8     -   23.5G       -    4.0G
  -
   all.q                BIP   0/0/8         au
compute-3-10            lx26-amd64      8  0.03   23.5G  847.6M    4.0G
 196.0K
compute-3-11            lx26-amd64      8  0.04   23.5G  742.7M    4.0G
 196.0K
compute-3-12            lx26-amd64      8  0.00   23.5G    1.0G    4.0G
 196.0K
compute-3-2             lx26-amd64      8  0.06   23.5G  821.3M    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-3-3             lx26-amd64      8  0.00   23.5G  927.4M    4.0G
 196.0K
compute-3-4             lx26-amd64      8  0.00   23.5G  617.4M    4.0G
24.6M
compute-3-5             lx26-amd64      8  0.10   23.5G    1.4G    4.0G
0.0
compute-3-6             lx26-amd64     16  0.17   23.5G  869.3M    4.0G
 260.0K
compute-3-7             lx26-amd64      8  0.00   23.5G  741.6M    4.0G
39.5M
   all.q                BIP   0/0/8
compute-3-8             lx26-amd64      8  0.00   23.5G  668.8M    4.0G
24.1M
   all.q                BIP   0/0/8
compute-3-9             lx26-amd64      8  0.02   23.5G  670.4M    4.0G
 196.0K
compute-30-1            lx26-amd64     80  0.04   62.9G    1.7G    4.0G
38.2M
t3-higgs                lx26-amd64      8  0.00   23.5G    1.3G    4.0G
 4.5M
   all.q                BIP   0/0/4

[2] :
[root@compute-3-5 ~]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
[email protected]        BIP   0/0/8          -NA-     lx26-amd64
 au
---------------------------------------------------------------------------------
[email protected]        BIP   0/0/8          0.05     lx26-amd64
---------------------------------------------------------------------------------
[email protected]        BIP   0/0/8          0.00     lx26-amd64
---------------------------------------------------------------------------------
[email protected]        BIP   0/0/8          0.00     lx26-amd64
---------------------------------------------------------------------------------
[email protected]  BIP   0/0/4          0.00     lx26-amd64

On Wed, Jul 3, 2013 at 9:50 AM, William Hay <[email protected]> wrote:

> On Tue, 2013-07-02 at 13:41 +0000, Samir Cury wrote:
> > Dear all,
> >
> > Our setup is the SGE that comes in a Rocks Roll, in principle already
> > automated/OOTB process to deploy it in the headnode/compute nodes with
> > their respective roles.
> >
> > Since our headnode's motherboard was replaced (in principle only
> > affects MAC address change for eth0,eth1), we have been facing some
> > problems with our SGE setup, I'd like to share the tests we did so
> > far, and if possible get some advice on what other tests can be done
> > to find the problem.
>
> > [root@t3-local ~]# qstat -f
> > queuename                      qtype resv/used/tot. load_avg arch
> >     states
> >
> ---------------------------------------------------------------------------------
> > [email protected]        BIP   0/0/8          -NA-     lx26-amd64
>    au
> >
> ---------------------------------------------------------------------------------
> > [email protected]        BIP   0/8/8          0.05     lx26-amd64
> >
> ---------------------------------------------------------------------------------
> > [email protected]        BIP   0/8/8          0.09     lx26-amd64
> >
> ---------------------------------------------------------------------------------
> > [email protected]        BIP   0/8/8          0.05     lx26-amd64
> >
> ---------------------------------------------------------------------------------
> > [email protected]       BIP   0/16/1         -NA-     lx26-amd64
>    auo
> >
> ---------------------------------------------------------------------------------
> > [email protected]       BIP   0/16/1         -NA-     lx26-amd64
>    auo
> >
> ---------------------------------------------------------------------------------
> > [email protected]       BIP   0/16/1         -NA-     lx26-amd64
>    auo
> >
> ---------------------------------------------------------------------------------
> > [email protected]  BIP   0/0/4          0.09     lx26-amd64
> >
> >
> The queue instances with 'o' in their state field are not configured to
> exist as far as grid engine is concerned and are merely being retained
> until the last job running in them finishes.  This is probably not what
> you want.
>
> I've seen occasions in the past where the queue instances don't match up
> with what is configured in the cluster queue.
>
> The problem may have manifested now because you've turned off the
> qmaster for the first time in (presumably) a long while and the on disk
> config doesn't quite match up with what was in memory prior to the
> outage.
>
> If this is the case you could possibly get them reconfigured by issuing
> a qconf -mq all.q making a trivial change (IIRC adding a space at the
> end of a line is sufficient) and saving.
>
> It may not help but it shouldn't hurt.
>
> If the queues don't lose at least the 'o' state then examine the output
> of qconf -sq all.q |grep '^hostlist' to see if the cluster queue
> indicated they should be there.
>
> Also qconf -sq all.q|grep '^slots' as you appear to have more slots
> running there than you have 'configured'
>
> compute-2-4.local is something else though (maybe just sge_execd down).
>
>
> William
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to