Artemis failback doesn't work in our scenario

Jo Stenberg Tue, 19 Nov 2019 08:24:12 -0800

Hi,

we setup an Artemis cluster with three broker instances installed on AWS
each in a different availability zones (A,B,C).
Only one of the broker instances is configured as master, because Artemis
message redistribution does not
take message filters into account
(https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#redistribution-and-filters-selectors)
and we use/need message filters.


Last week we experienced a network outage of one AWS availability zone. When
the network connectivity was restored we ended up with two of the broker
instances claiming to be active/live.

I tried to replicate this by manually blocking all TCP connections between
the servers, this is the behavior I see:

1) I started all three broker instances.
  => The broker instance in AZ A reports to be "live", the instance in AZ B
reports to be "backup server", and the instance in AZ C reports to be
"stopped".
2) I then cut the network connection between the server in AZ A and the
other servers but left the broker process on the Server in AZ A running.
  => Now broker B upgrades from backup to live server and broker C starts a
backup server.
  => broker A still thinks it is live - which is not a problem for as, as no
client can reach the broker.
4) I re-enabled TCP connections between the server in AZ A and the servers
in the other AZs.
  => Now broker A and broker B are permanently stay "live".

How can we achieve that after 4) either broker A or broker B shuts down?


Here is the cluster configuration we are currently using:

Broker Cluster Config in AZ A:
------------------------------
<connectors>
  <connector name="local-node-connector">tcp://broker-a:61617</connector>
  <connector name="remote-node-connector-0">tcp://broker-b:61617</connector>
  <connector name="remote-node-connector-1">tcp://broker-c:61617</connector>
</connectors>
<cluster-connections><cluster-connection name="cluster1">
  <message-load-balancing>ON_DEMAND</message-load-balancing>
  <connector-ref>local-cluster-node-connector</connector-ref>
  <static-connectors allow-direct-connections-only="true">
    <connector-ref>remote-node-connector-0</connector-ref>
    <connector-ref>remote-node-connector-1</connector-ref>
  </static-connectors>
</cluster-connection></cluster-connections>
<ha-policy><replication>
  <master>
    <cluser-name>cluster1</cluser-name>
    <check-for-live-server>true</check-for-live-server>
    <vote-on-replication-failure>true</vote-on-replication-failure>
  </master>
</replication></ha-policy>


Broker Cluster Config in AZ B:
------------------------------
<connectors>
  <connector name="local-node-connector">tcp://broker-b:61617</connector>
  <connector name="remote-node-connector-0">tcp://broker-a:61617</connector>
  <connector name="remote-node-connector-1">tcp://broker-c:61617</connector>
</connectors>
<cluster-connections><cluster-connection name="cluster1">
  <message-load-balancing>ON_DEMAND</message-load-balancing>
  <connector-ref>local-node-connector</connector-ref>
  <static-connectors allow-direct-connections-only="true">
    <connector-ref>remote-node-connector-0</connector-ref>
    <connector-ref>remote-node-connector-1</connector-ref>
  </static-connectors>
</cluster-connection></cluster-connections>
<ha-policy><replication>
  <slave>
    <cluser-name>cluster1</cluser-name>
    <allow-failback>true</allow-failback>
    <restart-backup>true</restart-backup>
    <quorum-vote-wait>15</quorum-vote-wait>
    <vote-retries>12</vote-retries>
    <vote-retry-wait>5000</vote-retry-wait>
  </slave>
</replication></ha-policy>


Broker Cluster Config in AZ C:
------------------------------
<connectors>
  <connector name="local-node-connector">tcp://broker-c:61617</connector>
  <connector name="remote-node-connector-0">tcp://broker-a:61617</connector>
  <connector name="remote-node-connector-1">tcp://broker-b:61617</connector>
</connectors>
<cluster-connections>
  <cluster-connection name="cluster1">
  <message-load-balancing>ON_DEMAND</message-load-balancing>
  <connector-ref>local-node-connector</connector-ref>
  <static-connectors allow-direct-connections-only="true">
    <connector-ref>remote-node-connector-0</connector-ref>
    <connector-ref>remote-node-connector-1</connector-ref>
  </static-connectors>
  </cluster-connection>
</cluster-connections>
<ha-policy><replication>
  <slave>
    <cluser-name>cluster1</cluser-name>
    <allow-failback>true</allow-failback>
    <restart-backup>true</restart-backup>
    <quorum-vote-wait>15</quorum-vote-wait>
    <vote-retries>12</vote-retries>
    <vote-retry-wait>5000</vote-retry-wait>
  </slave>
</replication></ha-policy>


Thanks for any help,
Jo




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Artemis failback doesn't work in our scenario

Reply via email to