Broker-J 7.1.1 I have a following HA node setup: Datacenter 1: node 1 (master) Datacenter 2: node 2 (replica), node 3 (replica)
All nodes are set to sync because data preservation and consistency is critical for this use-case (latency not high priority). Ping between the two datacenters (which are geographically nearly each other) is on average 2ms. What I see on a periodic basis is the constant timeouts between the nodes, which always recover approximately one second later. Log on node 1: 2019-04-10 20:35:24,898 WARN [Group-Change-Learner:fix_test:RCO_TEST_HA1] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining state for node 'STO2_TEST_HA' from group fix_test 2019-04-10 20:35:24,899 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] (q.m.h.left) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] HA-1006 : Left : Node : 'STO2_TEST_HA' (node2:5060) 2019-04-10 20:35:24,899 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from 'MASTER' to 'UNREACHABLE' 2019-04-10 20:35:25,907 WARN [Group-Change-Learner:fix_test:RCO_TEST_HA1] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'STO2_TEST_HA' from group fix_test is responding again. 2019-04-10 20:35:25,907 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] (q.m.h.joined) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] HA-1005 : Joined : Node : 'STO2_TEST_HA' (node2:5060) 2019-04-10 20:35:25,907 INFO [Group-Change-Learner:fix_test:RCO_TEST_HA1] (q.m.h.role_changed) - [grp(/fix_test)/vhn(/RCO_TEST_HA1)] [grp(/fix_test)] HA-1010 : Role change reported: Node : 'STO2_TEST_HA' (node2:5060) : from 'UNREACHABLE' to 'MASTER' Sample log entry on node 2 2019-04-11 06:15:20,769 WARN [Group-Change-Learner:fix_test:STO2_TEST_HA] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining state for node 'RCO_TEST_HA1' from group fix_test 2019-04-11 06:15:20,775 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] (q.m.h.left) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] HA-1006 : Left : Node : 'RCO_TEST_HA1' (node1:5000) 2019-04-11 06:15:20,775 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from 'REPLICA' to 'UNREACHABLE' 2019-04-11 06:15:21,782 WARN [Group-Change-Learner:fix_test:STO2_TEST_HA] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'RCO_TEST_HA1' from group fix_test is responding again. 2019-04-11 06:15:21,784 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] (q.m.h.joined) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] HA-1005 : Joined : Node : 'RCO_TEST_HA1' (node1:5000) 2019-04-11 06:15:21,784 INFO [Group-Change-Learner:fix_test:STO2_TEST_HA] (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO2_TEST_HA)] [grp(/fix_test)] HA-1010 : Role change reported: Node : 'RCO_TEST_HA1' (node1:5000) : from 'UNREACHABLE' to 'REPLICA' Log on node 3 is the same. This warning repeats pretty much every 30 minutes or so (not consistent). It doesn't really cause any harm, as I am assuming that there are latency spikes between the datacenters causing this issue, so is there a way to tune replication to be a bit less sensitive to it? Specifically its obvious that it successfully recovers replication 1 second after it detects a problem (that pattern is very consistent, going all the way back to January). Here's an example of 1, also recovering 1 second later: 2019-01-23 18:00:38,495 WARN [Group-Change-Learner:fix_test:STO_TEST_HA] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Timeout whilst determining state for node 'TEST_HA' from group fix_test 2019-01-23 18:00:38,495 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] (q.m.h.left) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] HA-1006 : Left : Node : 'TEST_HA' (testnode1:5050) 2019-01-23 18:00:38,496 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] HA-1010 : Role change reported: Node : 'TEST_TEST_HA' (testnode1:5050) : from 'REPLICA' to 'UNREACHABLE' 2019-01-23 18:00:39,505 WARN [Group-Change-Learner:fix_test:STO_TEST_HA] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Node 'TEST_HA' from group fix_test is responding again. 2019-01-23 18:00:39,506 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] (q.m.h.joined) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] HA-1005 : Joined : Node : 'TEST_HA' (testnode1:5050) 2019-01-23 18:00:39,506 INFO [Group-Change-Learner:fix_test:STO_TEST_HA] (q.m.h.role_changed) - [grp(/fix_test)/vhn(/STO_TEST_HA)] [grp(/fix_test)] HA-1010 : Role change reported: Node : 'TEST_HA' (testnode1:5050) : from 'UNREACHABLE' to 'REPLICA' It might be that HA is a bit over-sensitive to latency spikes, so I wonder if I could make it more tolerant (maybe wait an extra second?). Thanks!
