Digimer napsal(a):
Hi all,Starting a new thread from the "Clustered LVM with iptables issue" thread... I've decided to review how I do networking entirely in my cluster. I make zero claims to being great at networks, so I would love some feedback. I've got three active/passive bonded interfaces; Back-Channel, Storage and Internet-Facing networks. The IFN is "off limits" to the cluster as it is dedicated to hosted server traffic only. So before, I uses only the BCN for cluster traffic for cman/corosync multicast traffic, no rrp. A couple months ago, I had a cluster partition when VM live migration (also on the BCN) congested the network. So I decided to enable RRP using the SN as backup, which has been marginally successful. Now, I want to switch to unicast (<cman transport="udpu"), RRP with the SN as the backup and BCN as the primary ring and do a proper IPTables firewall. Is this sane? When I stopped iptables entirely and started cman with unicast + RRP, I saw this: ====] Node 1 Sep 11 17:31:24 node1 kernel: DLM (built Aug 10 2015 09:45:36) installed Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service. Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Corosync built-in features: nss dbus rdma snmp Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Successfully parsed cman config Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing transport (UDP/IP Unicast). Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing transport (UDP/IP Unicast). Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] The network interface [10.20.10.1] is now up. Sep 11 17:31:24 node1 corosync[2523]: [QUORUM] Using quorum provider quorum_cman Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Sep 11 17:31:24 node1 corosync[2523]: [CMAN ] CMAN 3.0.12.1 (built Jul 6 2015 05:30:35) started Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90 Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: openais checkpoint service B.01.01 Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync extended virtual synchrony service Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync configuration service Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync profile loading service Sep 11 17:31:24 node1 corosync[2523]: [QUORUM] Using quorum provider quorum_cman Sep 11 17:31:24 node1 corosync[2523]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Sep 11 17:31:24 node1 corosync[2523]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member {10.20.10.1} Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member {10.20.10.2} Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] The network interface [10.10.10.1] is now up. Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member {10.10.10.1} Sep 11 17:31:24 node1 corosync[2523]: [TOTEM ] adding new UDPU member {10.10.10.2} Sep 11 17:31:27 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 1 iface 10.10.10.1 to [1 of 3] Sep 11 17:31:27 node1 corosync[2523]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 17:31:27 node1 corosync[2523]: [CMAN ] quorum regained, resuming activity Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] This node is within the primary component and will provide service. Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[1]: 1 Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[1]: 1 Sep 11 17:31:27 node1 corosync[2523]: [CPG ] chosen downlist: sender r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:0 left:0) Sep 11 17:31:27 node1 corosync[2523]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 17:31:27 node1 corosync[2523]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[2]: 1 2 Sep 11 17:31:27 node1 corosync[2523]: [QUORUM] Members[2]: 1 2 Sep 11 17:31:27 node1 corosync[2523]: [CPG ] chosen downlist: sender r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0) Sep 11 17:31:27 node1 corosync[2523]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 17:31:29 node1 corosync[2523]: [TOTEM ] ring 1 active with no faults Sep 11 17:31:29 node1 fenced[2678]: fenced 3.0.12.1 started Sep 11 17:31:29 node1 dlm_controld[2691]: dlm_controld 3.0.12.1 started Sep 11 17:31:30 node1 gfs_controld[2755]: gfs_controld 3.0.12.1 started ==== ====] Node 2 Sep 11 17:31:23 node2 kernel: DLM (built Aug 10 2015 09:45:36) installed Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service. Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Corosync built-in features: nss dbus rdma snmp Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Successfully parsed cman config Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing transport (UDP/IP Unicast). Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing transport (UDP/IP Unicast). Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] The network interface [10.20.10.2] is now up. Sep 11 17:31:23 node2 corosync[2271]: [QUORUM] Using quorum provider quorum_cman Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Sep 11 17:31:23 node2 corosync[2271]: [CMAN ] CMAN 3.0.12.1 (built Jul 6 2015 05:30:35) started Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90 Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: openais checkpoint service B.01.01 Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync extended virtual synchrony service Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync configuration service Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync profile loading service Sep 11 17:31:23 node2 corosync[2271]: [QUORUM] Using quorum provider quorum_cman Sep 11 17:31:23 node2 corosync[2271]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Sep 11 17:31:23 node2 corosync[2271]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member {10.20.10.1} Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member {10.20.10.2} Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] The network interface [10.10.10.2] is now up. Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member {10.10.10.1} Sep 11 17:31:23 node2 corosync[2271]: [TOTEM ] adding new UDPU member {10.10.10.2} Sep 11 17:31:26 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 1 iface 10.10.10.2 to [1 of 3] Sep 11 17:31:26 node2 corosync[2271]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 17:31:26 node2 corosync[2271]: [CMAN ] quorum regained, resuming activity Sep 11 17:31:26 node2 corosync[2271]: [QUORUM] This node is within the primary component and will provide service. Sep 11 17:31:26 node2 corosync[2271]: [QUORUM] Members[1]: 2 Sep 11 17:31:26 node2 corosync[2271]: [QUORUM] Members[1]: 2 Sep 11 17:31:26 node2 corosync[2271]: [CPG ] chosen downlist: sender r(0) ip(10.20.10.2) r(1) ip(10.10.10.2) ; members(old:0 left:0) Sep 11 17:31:26 node2 corosync[2271]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 17:31:27 node2 corosync[2271]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 17:31:27 node2 corosync[2271]: [QUORUM] Members[2]: 1 2 Sep 11 17:31:27 node2 corosync[2271]: [QUORUM] Members[2]: 1 2 Sep 11 17:31:27 node2 corosync[2271]: [CPG ] chosen downlist: sender r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:1 left:0) Sep 11 17:31:27 node2 corosync[2271]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 17:31:28 node2 corosync[2271]: [TOTEM ] ring 1 active with no faults Sep 11 17:31:28 node2 fenced[2359]: fenced 3.0.12.1 started Sep 11 17:31:28 node2 dlm_controld[2390]: dlm_controld 3.0.12.1 started Sep 11 17:31:29 node2 gfs_controld[2442]: gfs_controld 3.0.12.1 started ==== This looked good to me. So I wanted to test RRP by ifdown'ing bcn_bond1 on node 1 only, leaving bcn_bond1 up on node2. The cluster survived and seemed to the SN, but I saw this error repeatedly printed; ====] Node 1 Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link1 Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface bcn_link1 Sep 11 17:31:46 node1 kernel: bcn_bond1: the permanent HWaddr of bcn_link1 - 52:54:00:b0:e4:c8 - is still in use by bcn_bond1 - set the HWaddr of bcn_link1 to a different address to avoid conflicts Sep 11 17:31:46 node1 kernel: bcn_bond1: making interface bcn_link2 the new active one Sep 11 17:31:46 node1 kernel: ICMPv6 NA: someone advertises our address fe80:0000:0000:0000:5054:00ff:feb0:e4c8 on bcn_link1! Sep 11 17:31:46 node1 kernel: bcn_bond1: Removing slave bcn_link2 Sep 11 17:31:46 node1 kernel: bcn_bond1: Releasing active interface bcn_link2 Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #7 bcn_link1, fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0, dropped=0, active_time=48987 secs Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #6 bcn_bond1, fe80::5054:ff:feb0:e4c8#123, interface stats: received=0, sent=0, dropped=0, active_time=48987 secs Sep 11 17:31:48 node1 ntpd[2037]: Deleting interface #3 bcn_bond1, 10.20.10.1#123, interface stats: received=0, sent=0, dropped=0, active_time=48987 secs Sep 11 17:31:51 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 677 iface 10.20.10.1 to [1 of 3] Sep 11 17:31:53 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:31:57 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 679 iface 10.20.10.1 to [1 of 3] Sep 11 17:31:59 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:04 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 681 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:06 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:11 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 683 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:13 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:17 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 685 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:19 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:24 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 687 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:26 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:31 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 689 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:33 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:37 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 691 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:39 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:44 node1 corosync[2523]: [TOTEM ] Incrementing problem counter for seqid 693 iface 10.20.10.1 to [1 of 3] Sep 11 17:32:46 node1 corosync[2523]: [TOTEM ] ring 0 active with no faults ==== ====] Node 2 Sep 11 17:31:48 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 676 iface 10.20.10.2 to [1 of 3] Sep 11 17:31:50 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:31:54 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 678 iface 10.20.10.2 to [1 of 3] Sep 11 17:31:56 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:01 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 680 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:03 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:08 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 682 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:10 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:14 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 684 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:16 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:21 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 686 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:23 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:28 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 688 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:30 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:35 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 690 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:37 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:41 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 692 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:43 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults Sep 11 17:32:48 node2 corosync[2271]: [TOTEM ] Incrementing problem counter for seqid 694 iface 10.20.10.2 to [1 of 3] Sep 11 17:32:50 node2 corosync[2271]: [TOTEM ] ring 0 active with no faults ==== When I ifup'ed bcn_bond1 on node1, the messages stopped printing. So before I even start on iptables, I am curious if I am doing something incorrect here. Advice?
Don't do ifdown. Corosync reacts on ifdown very badly (long time known issue, also it's one of the reason for knet in future version).
Also rrp active is not so well tested as passive, so give a try to passive. Honza
Thanks!
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
