On 10/06/2016 04:16 PM, Digimer wrote: > On 06/10/16 05:38 AM, Martin Schlegel wrote: >> Thanks for the confirmation Jan, but this sounds a bit scary to me ! >> >> Spinning this experiment a bit further ... >> >> Would this not also mean that with a passive rrp with 2 rings it only takes 2 >> different nodes that are not able to communicate on different networks at the >> same time to have all rings marked faulty on _every_node ... therefore all >> cluster members loosing quorum immediately even though n-2 cluster members >> are >> technically able to send and receive heartbeat messages through all 2 rings ? >> >> I really hope the answer is no and the cluster still somehow has a quorum in >> this case. >> >> Regards, >> Martin Schlegel >> >> >>> Jan Friesse <[email protected]> hat am 5. Oktober 2016 um 09:01 >>> geschrieben: >>> >>> Martin, >>> >>>> Hello all, >>>> >>>> I am trying to understand why the following 2 Corosync heartbeat ring >>>> failure >>>> scenarios >>>> I have been testing and hope somebody can explain why this makes any sense. >>>> >>>> Consider the following cluster: >>>> >>>> * 3x Nodes: A, B and C >>>> * 2x NICs for each Node >>>> * Corosync 2.3.5 configured with "rrp_mode: passive" and >>>> udpu transport with ring id 0 and 1 on each node. >>>> * On each node "corosync-cfgtool -s" shows: >>>> [...] ring 0 active with no faults >>>> [...] ring 1 active with no faults >>>> >>>> Consider the following scenarios: >>>> >>>> 1. On node A only block all communication on the first NIC configured with >>>> ring id 0 >>>> 2. On node A only block all communication on all NICs configured with >>>> ring id 0 and 1 >>>> >>>> The result of the above scenarios is as follows: >>>> >>>> 1. Nodes A, B and C (!) display the following ring status: >>>> [...] Marking ringid 0 interface <IP-Address> FAULTY >>>> [...] ring 1 active with no faults >>>> 2. Node A is shown as OFFLINE - B and C display the following ring status: >>>> [...] ring 0 active with no faults >>>> [...] ring 1 active with no faults >>>> >>>> Questions: >>>> 1. Is this the expected outcome ? >>> Yes >>> >>>> 2. In experiment 1. B and C can still communicate with each other over both >>>> NICs, so why are >>>> B and C not displaying a "no faults" status for ring id 0 and 1 just like >>>> in experiment 2. >>> Because this is how RRP works. RRP marks whole ring as failed so every >>> node sees that ring as failed. >>> >>>> when node A is completely unreachable ? >>> Because it's different scenario. In scenario 1 there are 3 nodes >>> membership where one of them has failed one ring -> whole ring is >>> failed. In scenario 2 there are 2 nodes membership where both rings >>> works as expected. Node A is completely unreachable and it's not in the >>> membership. >>> >>> Regards, >>> Honza > Have you considered using active/passive bonded interfaces? If you did, > you would be able to fail links in any order on the nodes and corosync > would not know the difference. > Still an interesting point I hadn't been aware of that far - although I knew the bits I probably hadn't thought about them enough till now...
Usually one - at least me so far - would rather think that having the awareness of redundany/cluster as high up as possible in the protocol/application-stack would open up possibilities for more appropriate reactions. _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
