>>> Martin Schlegel <[email protected]> schrieb am 06.10.2016 um 11:38 in Nachricht <1736253685.165937.28a72b84-a091-48c4-83f9-74a8bbde1a18.open-xchange@email.1und1 de>: > Thanks for the confirmation Jan, but this sounds a bit scary to me ! > > Spinning this experiment a bit further ... > > Would this not also mean that with a passive rrp with 2 rings it only takes > 2 > different nodes that are not able to communicate on different networks at > the > same time to have all rings marked faulty on _every_node ... therefore all > cluster members loosing quorum immediately even though n-2 cluster members > are > technically able to send and receive heartbeat messages through all 2 rings > ? > > I really hope the answer is no and the cluster still somehow has a quorum in > this case. > > Regards, > Martin Schlegel > > >> Jan Friesse <[email protected]> hat am 5. Oktober 2016 um 09:01 >> geschrieben: >> >> Martin, >> >> > Hello all, >> > >> > I am trying to understand why the following 2 Corosync heartbeat ring >> > failure >> > scenarios >> > I have been testing and hope somebody can explain why this makes any sense. >> > >> > Consider the following cluster: >> > >> > * 3x Nodes: A, B and C >> > * 2x NICs for each Node >> > * Corosync 2.3.5 configured with "rrp_mode: passive" and >> > udpu transport with ring id 0 and 1 on each node. >> > * On each node "corosync-cfgtool -s" shows: >> > [...] ring 0 active with no faults >> > [...] ring 1 active with no faults >> > >> > Consider the following scenarios: >> > >> > 1. On node A only block all communication on the first NIC configured with >> > ring id 0 >> > 2. On node A only block all communication on all NICs configured with >> > ring id 0 and 1 >> > >> > The result of the above scenarios is as follows: >> > >> > 1. Nodes A, B and C (!) display the following ring status: >> > [...] Marking ringid 0 interface <IP-Address> FAULTY >> > [...] ring 1 active with no faults >> > 2. Node A is shown as OFFLINE - B and C display the following ring status: >> > [...] ring 0 active with no faults >> > [...] ring 1 active with no faults >> > >> > Questions: >> > 1. Is this the expected outcome ? >> >> Yes >> >> > 2. In experiment 1. B and C can still communicate with each other over both >> > NICs, so why are >> > B and C not displaying a "no faults" status for ring id 0 and 1 just like >> > in experiment 2. >> >> Because this is how RRP works. RRP marks whole ring as failed so every >> node sees that ring as failed. >> >> > when node A is completely unreachable ? >> >> Because it's different scenario. In scenario 1 there are 3 nodes >> membership where one of them has failed one ring -> whole ring is >> failed. In scenario 2 there are 2 nodes membership where both rings >> works as expected. Node A is completely unreachable and it's not in the >> membership.
Did you ever wonder why it's named "ring"? ;-) Our rings used to fail on network load periodically. Regards, Ulrich >> >> Regards, >> Honza >> >> > Regards, >> > Martin Schlegel >> > >> > _______________________________________________ >> > Users mailing list: [email protected] >> > http://clusterlabs.org/mailman/listinfo/users >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: http://bugs.clusterlabs.org >> >> > > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
