Re: Questions on HA cluster and split brain

Anindya Haldar Wed, 13 Jun 2018 18:24:37 -0700

BTW, these are questions related to Artemis 2.4.0, which is what we are 
evaluating right now for our solution.



> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <anindya.hal...@oracle.com> wrote:
> 
> I have some questions related to the HA cluster, failover and split-brain 
> cases.
> 
> Suppose I have set up a 3 node cluster with:
> 
> A = master
> B = slave 1
> C = slave 2
> 
> Also suppose they are all part of same group, and are set up to offer 
> replication based HA.
> 
> Scenario 1
> ========
> Say,
> 
> B starts up and finds A
> B becomes the designated backup for A
> C starts up, and tries to find a live server in this group
> C figures that A already has a designated backup, which is B
> C keeps waiting until the network topology is changed
> 
> 
> Q1: At this point, will the transaction logs replicate from A to C?
> 
> Now let’s say
> 
> Node A (the current master) fails
> B becomes the new master
> 
> Q2: At this point will C become to new new back up for B, assuming A remains 
> in failed state?
> 
> Q3: If the answer to Q2 is yes, B will start replicating its journals to C; 
> is that correct?
> 
> 
> Scenario 2 (split brain detection case)
> =============================
> Say,
> 
> B detects a transient network failure with A
> B wants to figure out if it needs to take over and be the new master
> B starts a quorum voting process
> 
> The manual says this in the ‘High Availability and Failover’ section: 
> 
> "Specifically, the backup will become active when it loses connection to its 
> live server. This can be problematic because this can also happen because of 
> a temporary network problem. In order to address this issue, the backup will 
> try to determine whether it still can connect to the other servers in the 
> cluster. If it can connect to more than half the servers, it will become 
> active, if more than half the servers also disappeared with the live, the 
> backup will wait and try reconnecting with the live. This avoids a split 
> brain situation."
> 
> Q4: At this point, which nodes are expected to participate in quorum voting? 
> All of A, B and C? Or A and C only (B excludes itself from the set)? When it 
> says "half the servers”, I read it in a way that B includes itself in the 
> quorum voting. Is that the case?
> 
> Whereas in the ‘Avoiding Network Isolation’ section, the manual says this:
> 
> “Quorum voting is used by both the live and the backup to decide what to do 
> if a replication connection is disconnected. Basically the server will 
> request each live server in the cluster to vote as to whether it thinks the 
> server it is replicating to or from is still alive. This being the case the 
> minimum number of live/backup pairs needed is 3."
> 
> Q5: This implies only the live servers participate in quorum voting. Is that 
> correct?
> 
> Q6: If the answer to Q5 is yes, then how does the split brain detection (as 
> described in the quoted text right before Q4) work?
> 
> Q7: The text implies that in order to avoid split brain, a cluster needs at 
> least 3 live/backup PAIRS. To me that implies at least 6 broker instances are 
> needed in such a cluster; but that is kind of hard to believe, and I feel (I 
> may be wrong) it actually means 3 broker instances, assuming scenarios 1 and 
> 2 as described earlier are valid ones. Can you please clarify?
> 
> Would appreciate if someone can offer clarity on these questions.
> 
> Thanks,
> Anindya Haldar
> Oracle Marketing Cloud
>

Re: Questions on HA cluster and split brain

Reply via email to