Re: [gridengine users] Rocks+SGE - execd up, no shepherds or queues

Samir Cury Fri, 05 Jul 2013 12:10:19 -0700

Hi William,

Thanks for the directions, I tried changing the queue configuration and
host group configuration, issuing or not a restart on the master and exec
nodes, but not much changes.


Yes, we're using the spool, looking closer to it :

/opt/gridengine/default/spool/qmaster/qinstances/all.q
[root@t3-local all.q]# ll
total 68
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-2-2.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-2-4.local
-rw-r--r-- 1 sge sge  225 Oct 15  2012 compute-30-1.local
-rw-r--r-- 1 sge sge  224 Jun 16  2012 compute-3-10.local
-rw-r--r-- 1 sge sge  224 Jun 16  2012 compute-3-11.local
-rw-r--r-- 1 sge sge  224 Jun 16  2012 compute-3-12.local
-rw-r--r-- 1 sge sge  227 Sep 27  2012 compute-31-2.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-3-2.local
-rw-r--r-- 1 sge sge  223 Nov 20  2012 compute-3-3.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-3-4.local
-rw-r--r-- 1 sge sge  223 Jul  5 10:23 compute-3-5.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-3-6.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-3-7.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-3-8.local
-rw-r--r-- 1 sge sge  223 Jun 16  2012 compute-3-9.local
-rw-r--r-- 1 sge sge 2000 Sep 24  2012 ss
-rw-r--r-- 1 sge sge  229 Jun 16  2012 t3-higgs.ext.domain

It looks good, and the most surprising is that the diff between compute-3-5
(not working) and compute-3-7 (working) is the "version 7" and "version 5"
attributes. Not sure what it is (file serial number maybe) but doesn't look
very meaningful as other hosts have different numbers (up to 12).

I tried a bit of the obvious, moving the all.q directory to a backup name
and restart the master to see if it recreates it correctly. Nope. It only
got all my hosts missing. However, if I alter the queue "in memory" it
recreates an empty "all.q" directory.

Something I realized while trying other procedures is :

[root@t3-local all.q]# qmod -e all.q
Queue instance "[email protected]" is already in the specified state:
enabled
Queue instance "[email protected]" is already in the specified state:
enabled
Queue instance "[email protected]" is already in the specified
state: enabled
Queue instance "[email protected]" is already in the specified state:
enabled
Queue instance "[email protected]" is already in the specified state:
enabled

Meaning that although the hostgroup @allhosts looks what we want, qmod is
only considering those nodes for some reason.

Maybe the question now is -- what makes those nodes to be considered by
qstat and qmod, and how to include(or force) them into this list.

To isolate a hostgroup problem, I copied the list from qconf -mhgroup
@allhosts directly in all.q's hostlist, but not luck either.

Any idea on how to actually regenerate the all.q files in spool? That seems
to be the way. Summarizing :

 * qconf -mq all.q , changing (removing useless hostname) and saving
doesn't regenerate existing files
 * moving the dir and restarting doesn't do anything besides missing hosts
 * moving the dir and qconf -mq all.q only recreates the directory, not the
node files.

Thanks,
Samir

On Fri, Jul 5, 2013 at 5:38 PM, William Hay <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/07/13 16:11, Samir Cury wrote:
> > Hi William,
> >
> > Thanks for the comments. It helped me to find problems/cleanup a
> > bit the system. I realized that all -31- nodes were deprecated for
> > long and just hanging "orphans" there. We also had jobs stuck on
> > those entities that I removed.
> >
> > One funny behavior I noticed is even if I issue qconf -de for those
> > nodes, it only takes effect everywhere once I restart the master.
> > (compute-2-4 is down, and this is fine)
> >
> > Still, the main problem might be unrelated, I have hosts that
> > appear in qhost -q   [1]
> >
> > But although their daemons are running fine, they don't show up in
> > qstat -f  [2]
> >
> > Or seem to serve slots to any queue, although they appear
> > everywhere in the configuration. I will share a bit of it here :
> >
> > ------------- qconf -sq all.q qname                 all.q hostlist
> > @allhosts slots
> > 1,[compute-2-4.local=8],[compute-3-2.local=8], \
> > [compute-3-3.local=8],[compute-3-4.local=8], \
> > [compute-3-6.local=8],[compute-3-5.local=8], \
> > [compute-3-7.local=8],[compute-3-8.local=8], \
> > [compute-3-9.local=8],[compute-3-10.local=8], \
> > [compute-3-12.local=8],[compute-3-11.local=8], \
> > [t3-higgs.ext.domain=4],[compute-30-1.local=40]
> >
> > ------------- qconf -mhgrp @allhosts group_name @allhosts hostlist
> > t3-higgs.ultralight.org<http://t3-higgs.ultralight.org>
> > compute-3-7.local compute-2-4.local \ compute-3-3.local
> > compute-3-4.local compute-3-6.local \ compute-3-8.local
> > compute-3-9.local compute-3-10.local \ compute-3-11.local
> > compute-3-12.local compute-3-2.local \ compute-2-4.local
> > compute-30-1.local compute-3-5.local
> >
> > I think it just comes back to the FUTEX timeout, seems the only
> > difference I've seen between a working and non-working node. Let me
> > know if you have clues of what else to check. Network settings seem
> > to be the same in a working and non-working node.
> >
> I suspect this is a red herring -effect rather than cause.
> > Thanks, Samir
> >
> Your qhost -q command looks like it doesn't think there are queue
> instances there.  For most purposes cluster queues are just a way of
> creating queue instances en-masse.  If they get out of sync then it is
> the qinstances that count.
>
> If you are using classic spool have a look in
> $SGE_ROOT/$SGE_CELL/spool/qinstances/all.q to see if there are files
> named after the nodes there.
>
> If they're missing try making a dummy change to the cluster queue and
> the hostgroup to force re-creation of the qinstances.
>
> If you're not using classic spool this may be different but there's
> probably a similar database of qinstances somewhere in the spool.
>
> William
>
> >
> > [1] : [root@compute-3-5 ~]# qhost -q HOSTNAME                ARCH
> > NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
> >
> -------------------------------------------------------------------------------
> >
> >
> global                  -               -     -       -       -       -
>       -
> > compute-2-2             lx26-amd64      8     -   23.5G       -
> > 4.0G       - compute-2-4             lx26-amd64      8     -
> > 23.5G       -    4.0G       - all.q                BIP   0/0/8
> > au compute-3-10            lx26-amd64      8  0.03   23.5G  847.6M
> > 4.0G  196.0K compute-3-11            lx26-amd64      8  0.04
> > 23.5G  742.7M    4.0G  196.0K compute-3-12            lx26-amd64
> > 8  0.00   23.5G    1.0G    4.0G  196.0K compute-3-2
> > lx26-amd64      8  0.06   23.5G  821.3M    4.0G  196.0K all.q
> > BIP   0/0/8 compute-3-3             lx26-amd64      8  0.00   23.5G
> > 927.4M    4.0G  196.0K compute-3-4             lx26-amd64      8
> > 0.00   23.5G  617.4M    4.0G   24.6M compute-3-5
> > lx26-amd64      8  0.10   23.5G    1.4G    4.0G     0.0 compute-3-6
> > lx26-amd64     16  0.17   23.5G  869.3M    4.0G  260.0K compute-3-7
> > lx26-amd64      8  0.00   23.5G  741.6M    4.0G   39.5M all.q
> > BIP   0/0/8 compute-3-8             lx26-amd64      8  0.00   23.5G
> > 668.8M    4.0G   24.1M all.q                BIP   0/0/8 compute-3-9
> > lx26-amd64      8  0.02   23.5G  670.4M    4.0G  196.0K
> > compute-30-1            lx26-amd64     80  0.04   62.9G    1.7G
> > 4.0G   38.2M t3-higgs                lx26-amd64      8  0.00
> > 23.5G    1.3G    4.0G    4.5M all.q                BIP   0/0/4
> >
> > [2] : [root@compute-3-5 ~]# qstat -f queuename
> > qtype resv/used/tot. load_avg arch          states
> >
> ---------------------------------------------------------------------------------
> >
> >
> [email protected]        BIP   0/0/8          -NA-     lx26-amd64
>    au
> >
> ---------------------------------------------------------------------------------
> >
> >
> [email protected]        BIP   0/0/8          0.05     lx26-amd64
> >
> ---------------------------------------------------------------------------------
> >
> >
> [email protected]        BIP   0/0/8          0.00     lx26-amd64
> >
> ---------------------------------------------------------------------------------
> >
> >
> [email protected]        BIP   0/0/8          0.00     lx26-amd64
> >
> ---------------------------------------------------------------------------------
> >
> >
> [email protected]  BIP   0/0/4          0.00     lx26-amd64
> >
> > On Wed, Jul 3, 2013 at 9:50 AM, William Hay
> > <[email protected]<mailto:[email protected]>> wrote: On Tue, 2013-07-02
> > at 13:41 +0000, Samir Cury wrote:
> >> Dear all,
> >>
> >> Our setup is the SGE that comes in a Rocks Roll, in principle
> >> already automated/OOTB process to deploy it in the
> >> headnode/compute nodes with their respective roles.
> >>
> >> Since our headnode's motherboard was replaced (in principle only
> >> affects MAC address change for eth0,eth1), we have been facing
> >> some problems with our SGE setup, I'd like to share the tests we
> >> did so far, and if possible get some advice on what other tests
> >> can be done to find the problem.
> >
> >> [root@t3-local ~]# qstat -f queuename                      qtype
> >> resv/used/tot. load_avg arch states
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]        BIP   0/0/8          -NA-     lx26-amd64
>    au
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]        BIP   0/8/8          0.05     lx26-amd64
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]        BIP   0/8/8          0.09     lx26-amd64
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]        BIP   0/8/8          0.05     lx26-amd64
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]       BIP   0/16/1         -NA-     lx26-amd64
>    auo
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]       BIP   0/16/1         -NA-     lx26-amd64
>    auo
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]       BIP   0/16/1         -NA-     lx26-amd64
>    auo
> >>
> ---------------------------------------------------------------------------------
> >>
> >>
> [email protected]  BIP   0/0/4          0.09     lx26-amd64
> >>
> >>
> > The queue instances with 'o' in their state field are not
> > configured to exist as far as grid engine is concerned and are
> > merely being retained until the last job running in them finishes.
> > This is probably not what you want.
> >
> > I've seen occasions in the past where the queue instances don't
> > match up with what is configured in the cluster queue.
> >
> > The problem may have manifested now because you've turned off the
> > qmaster for the first time in (presumably) a long while and the on
> > disk config doesn't quite match up with what was in memory prior to
> > the outage.
> >
> > If this is the case you could possibly get them reconfigured by
> > issuing a qconf -mq all.q making a trivial change (IIRC adding a
> > space at the end of a line is sufficient) and saving.
> >
> > It may not help but it shouldn't hurt.
> >
> > If the queues don't lose at least the 'o' state then examine the
> > output of qconf -sq all.q |grep '^hostlist' to see if the cluster
> > queue indicated they should be there.
> >
> > Also qconf -sq all.q|grep '^slots' as you appear to have more
> > slots running there than you have 'configured'
> >
> > compute-2-4.local is something else though (maybe just sge_execd
> > down).
> >
> >
> > William
> >
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with undefined - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJR1uhyAAoJEKCzH4joEjNWdkoQAJl3dYK36Owv76AO9xjsnnjH
> 2r8N9O+8zJTv/fDULTDSP55uexmQL5jRnUJRyHe4xpIopYFYAI124ya1fMobbuey
> r5VKIjG/RLSnx8ptE5NyAqJ609OJNcelcDYi0cnuWf8R+EHaTR7gleWonNw28QHl
> raAK5OT6SN9l2dJn88/rhg4bn/ZkittxtZiXBJhp6FCHX3GgxyEMI9PHbPGd4q1f
> ImPihCWUTHTxkhqwCTxFFCJfB+i7yno3d/D1XBkRiVDm9L21fEU+ht4xxsbFG0t/
> EHPE3/EYsGUA7105SsNqHDuLUKtrHRprl+d+ky4O3iNWJ/8Fb+H3RDTOJ9ZBasAo
> xhnQSWgcjST8W0qfC5oP3lGFgA6Zfs2DmLGT7BfJJk2nflEXwY3XUhkT5bTPsMcz
> tOspfgnY2EkF7KPNPW5WXGawcXDj0yM1pJkuohaU5nWN5k7KnXA374FxRKJ0SfaM
> +J9uacnpeLbjsow8CnMS8P49K+Z4GbS/YW+D3K6UrBzJNDIhSlASbHMl7nLlOfXF
> oEiIerNs0UFwRgF4PLyTOPvFAsjrfXAUs1gl7j7JiDjnxc/KF77Ji5L4CLuM+2yT
> cD6iLJZyFlsGhlKG8aYMcgGqzoteA/AOspcxM5XUFi/+f9YntAEgX3Xz0hVf6ef1
> gHV9Y96u0Myj3F80IQjf
> =txP4
> -----END PGP SIGNATURE-----
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Rocks+SGE - execd up, no shepherds or queues

Reply via email to