BTW, these are questions related to Artemis 2.4.0, which is what we are evaluating right now for our solution.
> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <anindya.hal...@oracle.com> wrote: > > I have some questions related to the HA cluster, failover and split-brain > cases. > > Suppose I have set up a 3 node cluster with: > > A = master > B = slave 1 > C = slave 2 > > Also suppose they are all part of same group, and are set up to offer > replication based HA. > > Scenario 1 > ======== > Say, > > B starts up and finds A > B becomes the designated backup for A > C starts up, and tries to find a live server in this group > C figures that A already has a designated backup, which is B > C keeps waiting until the network topology is changed > > > Q1: At this point, will the transaction logs replicate from A to C? > > Now let’s say > > Node A (the current master) fails > B becomes the new master > > Q2: At this point will C become to new new back up for B, assuming A remains > in failed state? > > Q3: If the answer to Q2 is yes, B will start replicating its journals to C; > is that correct? > > > Scenario 2 (split brain detection case) > ============================= > Say, > > B detects a transient network failure with A > B wants to figure out if it needs to take over and be the new master > B starts a quorum voting process > > The manual says this in the ‘High Availability and Failover’ section: > > "Specifically, the backup will become active when it loses connection to its > live server. This can be problematic because this can also happen because of > a temporary network problem. In order to address this issue, the backup will > try to determine whether it still can connect to the other servers in the > cluster. If it can connect to more than half the servers, it will become > active, if more than half the servers also disappeared with the live, the > backup will wait and try reconnecting with the live. This avoids a split > brain situation." > > Q4: At this point, which nodes are expected to participate in quorum voting? > All of A, B and C? Or A and C only (B excludes itself from the set)? When it > says "half the servers”, I read it in a way that B includes itself in the > quorum voting. Is that the case? > > Whereas in the ‘Avoiding Network Isolation’ section, the manual says this: > > “Quorum voting is used by both the live and the backup to decide what to do > if a replication connection is disconnected. Basically the server will > request each live server in the cluster to vote as to whether it thinks the > server it is replicating to or from is still alive. This being the case the > minimum number of live/backup pairs needed is 3." > > Q5: This implies only the live servers participate in quorum voting. Is that > correct? > > Q6: If the answer to Q5 is yes, then how does the split brain detection (as > described in the quoted text right before Q4) work? > > Q7: The text implies that in order to avoid split brain, a cluster needs at > least 3 live/backup PAIRS. To me that implies at least 6 broker instances are > needed in such a cluster; but that is kind of hard to believe, and I feel (I > may be wrong) it actually means 3 broker instances, assuming scenarios 1 and > 2 as described earlier are valid ones. Can you please clarify? > > Would appreciate if someone can offer clarity on these questions. > > Thanks, > Anindya Haldar > Oracle Marketing Cloud >