I have some questions related to the HA cluster, failover and split-brain cases.

Suppose I have set up a 3 node cluster with:

A = master
B = slave 1
C = slave 2

Also suppose they are all part of same group, and are set up to offer 
replication based HA.

Scenario 1
========
Say,

B starts up and finds A
B becomes the designated backup for A
C starts up, and tries to find a live server in this group
C figures that A already has a designated backup, which is B
C keeps waiting until the network topology is changed


Q1: At this point, will the transaction logs replicate from A to C?

Now let’s say

Node A (the current master) fails
B becomes the new master

Q2: At this point will C become to new new back up for B, assuming A remains in 
failed state?

Q3: If the answer to Q2 is yes, B will start replicating its journals to C; is 
that correct?


Scenario 2 (split brain detection case)
=============================
Say,

B detects a transient network failure with A
B wants to figure out if it needs to take over and be the new master
B starts a quorum voting process

The manual says this in the ‘High Availability and Failover’ section: 

"Specifically, the backup will become active when it loses connection to its 
live server. This can be problematic because this can also happen because of a 
temporary network problem. In order to address this issue, the backup will try 
to determine whether it still can connect to the other servers in the cluster. 
If it can connect to more than half the servers, it will become active, if more 
than half the servers also disappeared with the live, the backup will wait and 
try reconnecting with the live. This avoids a split brain situation."

Q4: At this point, which nodes are expected to participate in quorum voting? 
All of A, B and C? Or A and C only (B excludes itself from the set)? When it 
says "half the servers”, I read it in a way that B includes itself in the 
quorum voting. Is that the case?

Whereas in the ‘Avoiding Network Isolation’ section, the manual says this:

“Quorum voting is used by both the live and the backup to decide what to do if 
a replication connection is disconnected. Basically the server will request 
each live server in the cluster to vote as to whether it thinks the server it 
is replicating to or from is still alive. This being the case the minimum 
number of live/backup pairs needed is 3."

Q5: This implies only the live servers participate in quorum voting. Is that 
correct?

Q6: If the answer to Q5 is yes, then how does the split brain detection (as 
described in the quoted text right before Q4) work?

Q7: The text implies that in order to avoid split brain, a cluster needs at 
least 3 live/backup PAIRS. To me that implies at least 6 broker instances are 
needed in such a cluster; but that is kind of hard to believe, and I feel (I 
may be wrong) it actually means 3 broker instances, assuming scenarios 1 and 2 
as described earlier are valid ones. Can you please clarify?

Would appreciate if someone can offer clarity on these questions.

Thanks,
Anindya Haldar
Oracle Marketing Cloud

Reply via email to