We have faced issues with serial replication when one of the region server
of either cluster goes into hardware failure, typically memory from my
understanding. I could not spend enough time to reproduce reliably to
identify the root cause. So I don't know why it is caused.

Issue could be your serial replication has got into deadlock mode among the
region servers. Who are not able to make any progress because older
sequence ID is not replicated and older sequence ID is not in front of the
line to be able to replicate itself.

Quick fix: disable serial replication temporarily so that out of ordering
is allowed to unblock the replication. Can result into some inconsistencies
between clusters which can be fixed using sync table utility since your
setup is active passive

Another fix: delete barriers for each regions in hbase:meta. Same
consequence as above.


On Sun, Dec 12, 2021, 2:24 PM Hamado Dene <hamadod...@yahoo.com.invalid>
wrote:

> I'm using hbase 2.2.6 with hadoop 2.8.5.Yes, My replication serial is
> enabled.This is my peer configuration
>
>
>
> |
> | Peer Id | Cluster Key | Endpoint | State | IsSerial | Bandwidth |
> ReplicateAll | Namespaces | Exclude Namespaces | Table Cfs | Exclude Table
> Cfs |
> | replicav1 | acv-db10-hn,acv-db11-hn,acv-db12-hn:2181:/hbase |  | ENABLED
> | true | UNLIMITED | true
>
>  |
>
>     Il domenica 12 dicembre 2021, 09:39:44 CET, Mallikarjun <
> mallik.v.ar...@gmail.com> ha scritto:
>
>  Which version of hbase are you using? Is your replication serial enabled?
>
> ---
> Mallikarjun
>
>
> On Sun, Dec 12, 2021 at 1:54 PM Hamado Dene <hamadod...@yahoo.com.invalid>
> wrote:
>
> > Hi Hbase community,
> >
> > On our production installation we have two hbase clusters in two
> different
> > datacenters.The primary datacenter replicates the data to the secondary
> > datacenter.When we create the tables, we first create on the secondary
> > datacenter and then on the primary and then we set replication scope to 1
> > on the primary.The peer pointing to quorum zk of the secondary cluster is
> > configured on the primary.
> > Initially, replication worked fine and data was replicated.We have
> > recently noticed that some tables are empty in the secondary datacenter.
> So
> > most likely the data is no longer replicated. I'm seeing lines like this
> in
> > the logs:
> >
> >
> > Recovered source for cluster/machine(s) replicav1: Total replicated
> edits:
> > 0, current progress:walGroup [db11%2C16020%2C1637849866921]: currently
> > replicating from:
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/db11-hd%2C16020%2C1637849866921.1637849874263
> > at position: -1
> > Recovered source for cluster/machine(s) replicav1: Total replicated
> edits:
> > 0, current progress:walGroup [db09%2C16020%2C1637589840862]: currently
> > replicating from:
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/db09-hd%2C16020%2C1637589840862.1637589846870
> > at position: -1
> > Recovered source for cluster/machine(s) replicav1: Total replicated
> edits:
> > 0, current progress:walGroup [db13%2C16020%2C1635424806449]: currently
> > replicating from:
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/db13%2C16020%2C1635424806449.1635424812985
> > at position: -1
> >
> >
> >
> > 2021-12-12 09:13:47,148 INFO  [rzv-db09-hd:16020Replication Statistics
> #0]
> > regionserver.Replication: ormal source for cluster replicav1: Total
> > replicated edits: 0, current progress:walGroup
> > [db09%2C16020%2C1638791923537]: currently replicating from:
> >
> hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db09-hd.rozzano.diennea.lan,16020,1638791923537/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1638791923537.1638791930213
> > at position: -1
> > Recovered source for cluster/machine(s) replicav1: Total replicated
> edits:
> > 0, current progress:walGroup [db09%2C16020%2C1634401671527]: currently
> > replicating from:
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1634401671527.1634401679218
> > at position: -1
> > Recovered source for cluster/machine(s) replicav1: Total replicated
> edits:
> > 0, current progress:walGroup [db10%2C16020%2C1637585899997]: currently
> > replicating from:
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db10-hd.rozzano.diennea.lan%2C16020%2C1637585899997.1637585906625
> > at position: -1
> >
> >
> >
> > 2021-12-12 08:24:58,561 WARN  [regionserver/rzv-db12-hd:16020.logRoller]
> > regionserver.ReplicationSource: WAL group db12%2C16020%2C1638790692057
> > queue size: 187 exceeds value of replication.source.log.queue.warn: 2
> > Do you have any info on what could be the problem?
> >
> > Thanks
>

Reply via email to