I have four HBase clusters (A, B, C, D) with replication between them. I've
configured each cluster to replicate to all other clusters in an attempt to
have a hot-hot, eventually-consistent-across-all-clusters setup. One nice
property of this configuration is that any individual cluster can
completely fail without any impact on the others. If the failed cluster
comes back up, replication eventually catches up. No manual intervention is
needed during the failure or subsequent recovery.

However, reading the docs and making observations, it seems this setup
results in each replication wal entry arriving at each remote cluster 5
times!

Here's how a wal entry originating on A arrives at D 5 times:

  A -> D
  A -> B -> D
  A -> C -> D
  A -> B -> C -> D
  A -> C -> B -> D

In other words, my replication topology choice combined with HBase's
record-hops-and-don't-revisit replication loop prevention strategy leads to
a lot of network and compute resource waste.

I understand I could use a different topology. A ring for example or maybe
slightly better:

  A <-> B <-> C <-> D

But alternate topologies seem to lead to failure scenarios that require
manual intervention (ie new peer connections at failure time and sync-up
jobs).

Is there a way I can maintain the property where no manual intervention is
needed when an HBase cluster goes away but also eliminate all these
redundant replications?

For example, is there a way to place a limit on the number of hops a wal
entry can make?

Thanks!

Whitney

Reply via email to