-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

We've just reinstalled a cluster that was running RHEL5 with RHEL6 and
gone from xCAT 2.6.9 to 2.8.2.

Reusing the same switch definitions that worked for discovery in xCAT
2.6.9 (and adding the nodes from scratch) we find that we are getting
very unreliable node discovery, often with many nodes discovered as
merri001 or merri015.

Here you can see the first 14 nodes discovered fine, then then next two
are both discovered as merri001:

Aug  6 16:14:23 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.2
Aug  6 16:14:28 merri-m xCAT node discovery: merri001 has been discovered
Aug  6 16:17:28 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.4
Aug  6 16:17:34 merri-m xCAT node discovery: merri002 has been discovered
Aug  6 16:22:19 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.6
Aug  6 16:22:25 merri-m xCAT node discovery: merri003 has been discovered
Aug  6 16:22:50 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.13
Aug  6 16:22:53 merri-m xCAT node discovery: merri004 has been discovered
Aug  6 16:22:59 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.15
Aug  6 16:23:04 merri-m xCAT node discovery: merri005 has been discovered
Aug  6 16:23:09 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.16
Aug  6 16:23:11 merri-m xCAT node discovery: merri007 has been discovered
Aug  6 16:23:11 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.17
Aug  6 16:23:13 merri-m xCAT node discovery: merri006 has been discovered
Aug  6 16:23:14 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.18
Aug  6 16:23:16 merri-m xCAT node discovery: merri008 has been discovered
Aug  6 16:23:16 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.19
Aug  6 16:23:18 merri-m xCAT node discovery: merri010 has been discovered
Aug  6 16:23:22 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.20
Aug  6 16:23:24 merri-m xCAT node discovery: merri009 has been discovered
Aug  6 16:27:40 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.25
Aug  6 16:27:46 merri-m xCAT node discovery: merri011 has been discovered
Aug  6 16:27:46 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.26
Aug  6 16:27:48 merri-m xCAT node discovery: merri012 has been discovered
Aug  6 16:27:55 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.28
Aug  6 16:27:57 merri-m xCAT node discovery: merri013 has been discovered
Aug  6 16:28:01 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.29
Aug  6 16:28:03 merri-m xCAT node discovery: merri014 has been discovered
Aug  6 16:28:10 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.34
Aug  6 16:28:13 merri-m xCAT node discovery: merri001 has been discovered
Aug  6 16:28:16 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.36
Aug  6 16:28:22 merri-m xCAT node discovery: merri001 has been discovered
Aug  6 16:28:26 merri-m xCAT: xcatd: Processing discovery request from 10.6.3.38
Aug  6 16:28:28 merri-m xCAT node discovery: merri001 has been discovered

It's not always the same nodes that are getting misidentified either, I
could handle it if it was consistent.  What is consistent, though is
that it seems to be merri001 and merri015 which accumulate the
duplicates.

Our switch config is:

#node,switch,port,vlan,interface,comments,disable
"merri","sw03","27",,,,
"merri-v","sw04","27",,,,
"sw02b","sw02","|\D+0*(\d+)|(($1-26))|",,,,
"sw02a","sw02","|\D+0*(\d+)|($1)|",,,,
"sw03a","sw03","|\D+0*(\d+)|(($1)-14)|",,,,
"sw04a","sw04","|\D+0*(\d+)|(($1)-54)|",,,,
"nsd","sw05","|\D+0*(\d+)|(($1)+24)|",,,,
"merri081","sw05","35",,,,
"merri082","sw05","36",,,,
"merri083","sw06","22",,,,
"terri","sw06","16",,,,
"turpin","sw05","17",,,,

We added nodes as:

nodeadd merri001-merri014 groups=all,compute,sw02a,ipmis
nodeadd merri015-merri040 groups=all,compute,sw03a,ipmis
nodeadd merri041-merri054 groups=all,compute,sw02b,ipmis
nodeadd merri055-merri080 groups=all,compute,sw04a,ipmis

The groups sw02a, sw03a, sw02b and sw04a define the switch ports thus:

sw02a:
    objtype=group
    grouptype=static
    
members=merri001,merri002,merri003,merri004,merri005,merri006,merri007,merri008,merri009,merri010,merri011,merri012,merri013,merri014
    switch=sw02
    switchport=|\D+0*(\d+)|($1)|

sw02b:
    objtype=group
    grouptype=static
    
members=merri041,merri042,merri043,merri044,merri045,merri046,merri047,merri048,merri049,merri050,merri051,merri052,merri053,merri054
    switch=sw02
    switchport=|\D+0*(\d+)|(($1-26))|

sw03a:
    objtype=group
    grouptype=static
    
members=merri015,merri016,merri017,merri018,merri019,merri020,merri021,merri022,merri023,merri024,merri025,merri026,merri027,merri028,merri029,merri030,merri031,merri032,merri033,merri034,merri035,merri036,merri037,merri038,merri039,merri040
    switch=sw03
    switchport=|\D+0*(\d+)|(($1)-14)|

sw04a:
    objtype=group
    grouptype=static
    
members=merri055,merri056,merri057,merri058,merri059,merri060,merri061,merri062,merri063,merri064,merri065,merri066,merri067,merri068,merri069,merri070,merri071,merri072,merri073,merri074,merri075,merri076,merri077,merri078,merri079,merri080
    switch=sw04
    switchport=|\D+0*(\d+)|(($1)-54)|



Has anyone seen anything like this?

The fact that we've not changed hardware at all, merely RHEL
and xCAT makes me think it's xCAT related and the fact that
it's not consistent makes me wonder if it's some odd state
related problem (or a race condidtion).

All the best!
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIAqBgACgkQO2KABBYQAh/pugCeNJw0GImzok7yftoUQezNc8Tl
3/oAn2xA9ywUgi5QYYtH4EqhYJ2IeIx7
=CSDz
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to