Hi William, Thanks for the directions, I tried changing the queue configuration and host group configuration, issuing or not a restart on the master and exec nodes, but not much changes.
Yes, we're using the spool, looking closer to it : /opt/gridengine/default/spool/qmaster/qinstances/all.q [root@t3-local all.q]# ll total 68 -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-2-2.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-2-4.local -rw-r--r-- 1 sge sge 225 Oct 15 2012 compute-30-1.local -rw-r--r-- 1 sge sge 224 Jun 16 2012 compute-3-10.local -rw-r--r-- 1 sge sge 224 Jun 16 2012 compute-3-11.local -rw-r--r-- 1 sge sge 224 Jun 16 2012 compute-3-12.local -rw-r--r-- 1 sge sge 227 Sep 27 2012 compute-31-2.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-3-2.local -rw-r--r-- 1 sge sge 223 Nov 20 2012 compute-3-3.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-3-4.local -rw-r--r-- 1 sge sge 223 Jul 5 10:23 compute-3-5.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-3-6.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-3-7.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-3-8.local -rw-r--r-- 1 sge sge 223 Jun 16 2012 compute-3-9.local -rw-r--r-- 1 sge sge 2000 Sep 24 2012 ss -rw-r--r-- 1 sge sge 229 Jun 16 2012 t3-higgs.ext.domain It looks good, and the most surprising is that the diff between compute-3-5 (not working) and compute-3-7 (working) is the "version 7" and "version 5" attributes. Not sure what it is (file serial number maybe) but doesn't look very meaningful as other hosts have different numbers (up to 12). I tried a bit of the obvious, moving the all.q directory to a backup name and restart the master to see if it recreates it correctly. Nope. It only got all my hosts missing. However, if I alter the queue "in memory" it recreates an empty "all.q" directory. Something I realized while trying other procedures is : [root@t3-local all.q]# qmod -e all.q Queue instance "[email protected]" is already in the specified state: enabled Queue instance "[email protected]" is already in the specified state: enabled Queue instance "[email protected]" is already in the specified state: enabled Queue instance "[email protected]" is already in the specified state: enabled Queue instance "[email protected]" is already in the specified state: enabled Meaning that although the hostgroup @allhosts looks what we want, qmod is only considering those nodes for some reason. Maybe the question now is -- what makes those nodes to be considered by qstat and qmod, and how to include(or force) them into this list. To isolate a hostgroup problem, I copied the list from qconf -mhgroup @allhosts directly in all.q's hostlist, but not luck either. Any idea on how to actually regenerate the all.q files in spool? That seems to be the way. Summarizing : * qconf -mq all.q , changing (removing useless hostname) and saving doesn't regenerate existing files * moving the dir and restarting doesn't do anything besides missing hosts * moving the dir and qconf -mq all.q only recreates the directory, not the node files. Thanks, Samir On Fri, Jul 5, 2013 at 5:38 PM, William Hay <[email protected]> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 05/07/13 16:11, Samir Cury wrote: > > Hi William, > > > > Thanks for the comments. It helped me to find problems/cleanup a > > bit the system. I realized that all -31- nodes were deprecated for > > long and just hanging "orphans" there. We also had jobs stuck on > > those entities that I removed. > > > > One funny behavior I noticed is even if I issue qconf -de for those > > nodes, it only takes effect everywhere once I restart the master. > > (compute-2-4 is down, and this is fine) > > > > Still, the main problem might be unrelated, I have hosts that > > appear in qhost -q [1] > > > > But although their daemons are running fine, they don't show up in > > qstat -f [2] > > > > Or seem to serve slots to any queue, although they appear > > everywhere in the configuration. I will share a bit of it here : > > > > ------------- qconf -sq all.q qname all.q hostlist > > @allhosts slots > > 1,[compute-2-4.local=8],[compute-3-2.local=8], \ > > [compute-3-3.local=8],[compute-3-4.local=8], \ > > [compute-3-6.local=8],[compute-3-5.local=8], \ > > [compute-3-7.local=8],[compute-3-8.local=8], \ > > [compute-3-9.local=8],[compute-3-10.local=8], \ > > [compute-3-12.local=8],[compute-3-11.local=8], \ > > [t3-higgs.ext.domain=4],[compute-30-1.local=40] > > > > ------------- qconf -mhgrp @allhosts group_name @allhosts hostlist > > t3-higgs.ultralight.org<http://t3-higgs.ultralight.org> > > compute-3-7.local compute-2-4.local \ compute-3-3.local > > compute-3-4.local compute-3-6.local \ compute-3-8.local > > compute-3-9.local compute-3-10.local \ compute-3-11.local > > compute-3-12.local compute-3-2.local \ compute-2-4.local > > compute-30-1.local compute-3-5.local > > > > I think it just comes back to the FUTEX timeout, seems the only > > difference I've seen between a working and non-working node. Let me > > know if you have clues of what else to check. Network settings seem > > to be the same in a working and non-working node. > > > I suspect this is a red herring -effect rather than cause. > > Thanks, Samir > > > Your qhost -q command looks like it doesn't think there are queue > instances there. For most purposes cluster queues are just a way of > creating queue instances en-masse. If they get out of sync then it is > the qinstances that count. > > If you are using classic spool have a look in > $SGE_ROOT/$SGE_CELL/spool/qinstances/all.q to see if there are files > named after the nodes there. > > If they're missing try making a dummy change to the cluster queue and > the hostgroup to force re-creation of the qinstances. > > If you're not using classic spool this may be different but there's > probably a similar database of qinstances somewhere in the spool. > > William > > > > > [1] : [root@compute-3-5 ~]# qhost -q HOSTNAME ARCH > > NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS > > > ------------------------------------------------------------------------------- > > > > > global - - - - - - > - > > compute-2-2 lx26-amd64 8 - 23.5G - > > 4.0G - compute-2-4 lx26-amd64 8 - > > 23.5G - 4.0G - all.q BIP 0/0/8 > > au compute-3-10 lx26-amd64 8 0.03 23.5G 847.6M > > 4.0G 196.0K compute-3-11 lx26-amd64 8 0.04 > > 23.5G 742.7M 4.0G 196.0K compute-3-12 lx26-amd64 > > 8 0.00 23.5G 1.0G 4.0G 196.0K compute-3-2 > > lx26-amd64 8 0.06 23.5G 821.3M 4.0G 196.0K all.q > > BIP 0/0/8 compute-3-3 lx26-amd64 8 0.00 23.5G > > 927.4M 4.0G 196.0K compute-3-4 lx26-amd64 8 > > 0.00 23.5G 617.4M 4.0G 24.6M compute-3-5 > > lx26-amd64 8 0.10 23.5G 1.4G 4.0G 0.0 compute-3-6 > > lx26-amd64 16 0.17 23.5G 869.3M 4.0G 260.0K compute-3-7 > > lx26-amd64 8 0.00 23.5G 741.6M 4.0G 39.5M all.q > > BIP 0/0/8 compute-3-8 lx26-amd64 8 0.00 23.5G > > 668.8M 4.0G 24.1M all.q BIP 0/0/8 compute-3-9 > > lx26-amd64 8 0.02 23.5G 670.4M 4.0G 196.0K > > compute-30-1 lx26-amd64 80 0.04 62.9G 1.7G > > 4.0G 38.2M t3-higgs lx26-amd64 8 0.00 > > 23.5G 1.3G 4.0G 4.5M all.q BIP 0/0/4 > > > > [2] : [root@compute-3-5 ~]# qstat -f queuename > > qtype resv/used/tot. load_avg arch states > > > --------------------------------------------------------------------------------- > > > > > [email protected] BIP 0/0/8 -NA- lx26-amd64 > au > > > --------------------------------------------------------------------------------- > > > > > [email protected] BIP 0/0/8 0.05 lx26-amd64 > > > --------------------------------------------------------------------------------- > > > > > [email protected] BIP 0/0/8 0.00 lx26-amd64 > > > --------------------------------------------------------------------------------- > > > > > [email protected] BIP 0/0/8 0.00 lx26-amd64 > > > --------------------------------------------------------------------------------- > > > > > [email protected] BIP 0/0/4 0.00 lx26-amd64 > > > > On Wed, Jul 3, 2013 at 9:50 AM, William Hay > > <[email protected]<mailto:[email protected]>> wrote: On Tue, 2013-07-02 > > at 13:41 +0000, Samir Cury wrote: > >> Dear all, > >> > >> Our setup is the SGE that comes in a Rocks Roll, in principle > >> already automated/OOTB process to deploy it in the > >> headnode/compute nodes with their respective roles. > >> > >> Since our headnode's motherboard was replaced (in principle only > >> affects MAC address change for eth0,eth1), we have been facing > >> some problems with our SGE setup, I'd like to share the tests we > >> did so far, and if possible get some advice on what other tests > >> can be done to find the problem. > > > >> [root@t3-local ~]# qstat -f queuename qtype > >> resv/used/tot. load_avg arch states > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/0/8 -NA- lx26-amd64 > au > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/8/8 0.05 lx26-amd64 > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/8/8 0.09 lx26-amd64 > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/8/8 0.05 lx26-amd64 > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/16/1 -NA- lx26-amd64 > auo > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/16/1 -NA- lx26-amd64 > auo > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/16/1 -NA- lx26-amd64 > auo > >> > --------------------------------------------------------------------------------- > >> > >> > [email protected] BIP 0/0/4 0.09 lx26-amd64 > >> > >> > > The queue instances with 'o' in their state field are not > > configured to exist as far as grid engine is concerned and are > > merely being retained until the last job running in them finishes. > > This is probably not what you want. > > > > I've seen occasions in the past where the queue instances don't > > match up with what is configured in the cluster queue. > > > > The problem may have manifested now because you've turned off the > > qmaster for the first time in (presumably) a long while and the on > > disk config doesn't quite match up with what was in memory prior to > > the outage. > > > > If this is the case you could possibly get them reconfigured by > > issuing a qconf -mq all.q making a trivial change (IIRC adding a > > space at the end of a line is sufficient) and saving. > > > > It may not help but it shouldn't hurt. > > > > If the queues don't lose at least the 'o' state then examine the > > output of qconf -sq all.q |grep '^hostlist' to see if the cluster > > queue indicated they should be there. > > > > Also qconf -sq all.q|grep '^slots' as you appear to have more > > slots running there than you have 'configured' > > > > compute-2-4.local is something else though (maybe just sge_execd > > down). > > > > > > William > > > > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with undefined - http://www.enigmail.net/ > > iQIcBAEBAgAGBQJR1uhyAAoJEKCzH4joEjNWdkoQAJl3dYK36Owv76AO9xjsnnjH > 2r8N9O+8zJTv/fDULTDSP55uexmQL5jRnUJRyHe4xpIopYFYAI124ya1fMobbuey > r5VKIjG/RLSnx8ptE5NyAqJ609OJNcelcDYi0cnuWf8R+EHaTR7gleWonNw28QHl > raAK5OT6SN9l2dJn88/rhg4bn/ZkittxtZiXBJhp6FCHX3GgxyEMI9PHbPGd4q1f > ImPihCWUTHTxkhqwCTxFFCJfB+i7yno3d/D1XBkRiVDm9L21fEU+ht4xxsbFG0t/ > EHPE3/EYsGUA7105SsNqHDuLUKtrHRprl+d+ky4O3iNWJ/8Fb+H3RDTOJ9ZBasAo > xhnQSWgcjST8W0qfC5oP3lGFgA6Zfs2DmLGT7BfJJk2nflEXwY3XUhkT5bTPsMcz > tOspfgnY2EkF7KPNPW5WXGawcXDj0yM1pJkuohaU5nWN5k7KnXA374FxRKJ0SfaM > +J9uacnpeLbjsow8CnMS8P49K+Z4GbS/YW+D3K6UrBzJNDIhSlASbHMl7nLlOfXF > oEiIerNs0UFwRgF4PLyTOPvFAsjrfXAUs1gl7j7JiDjnxc/KF77Ji5L4CLuM+2yT > cD6iLJZyFlsGhlKG8aYMcgGqzoteA/AOspcxM5XUFi/+f9YntAEgX3Xz0hVf6ef1 > gHV9Y96u0Myj3F80IQjf > =txP4 > -----END PGP SIGNATURE----- > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
