Thanks everyone for the info. We apply the fix mentioned by Mallik and we want 
to propose again the serial mode when we update hbase
    Il lunedì 13 dicembre 2021, 07:14:43 CET, Mallikarjun 
<mallik.v.ar...@gmail.com> ha scritto:  
 
 Thanks Duo

I will patch this and verify for the issue I mentioned above.


On Sun, Dec 12, 2021, 8:06 PM 张铎(Duo Zhang) <palomino...@gmail.com> wrote:

> We have fixed several replication related issues which may cause data loss,
> for example, this one
>
> https://issues.apache.org/jira/browse/HBASE-26482
>
> For serial replication, if we miss some wal files, it usually causes
> replication to be stuck...
>
> Mallikarjun <mallik.v.ar...@gmail.com> 于2021年12月12日周日 18:19写道:
>
> > Sync table is to be run manually when you think there can be
> > inconsistencies between the 2 clusters only for specific time period.
> >
> > As soon as you disable serial replication, it should start replicating
> from
> > the time it was stuck. You can build dashboards from jmx metrics
> generated
> > from hmaster to know about these and setup alerts as well.
> >
> >
> >
> > On Sun, Dec 12, 2021, 3:33 PM Hamado Dene <hamadod...@yahoo.com.invalid>
> > wrote:
> >
> > > Ok perfect.How often should this sync run? I guess in this case you
> have
> > > to automate it somehow, correct?
> > > Since I will have to disable serial mode, do I first have to align
> tables
> > > manually or the moment I disable serial mode, the regionservers will
> > start
> > > replicating from where they were blocked?
> > >
> > >
> > >    Il domenica 12 dicembre 2021, 10:55:05 CET, Mallikarjun <
> > > mallik.v.ar...@gmail.com> ha scritto:
> > >
> > >  https://hbase.apache.org/book.html#hashtable.synctable
> > >
> > > To copy the difference between tables for a specific time period.
> > >
> > > On Sun, Dec 12, 2021, 3:12 PM Hamado Dene <hamadod...@yahoo.com.invalid
> >
> > > wrote:
> > >
> > > > Interesting, thank you very much for the info. I'll try to disable
> > serial
> > > > replication.As for "sync table utility" what do you mean?I am new to
> > > Hbase,
> > > > I am not yet familiar with all Hbase tools.
> > > >
> > > >
> > > >
> > > >    Il domenica 12 dicembre 2021, 10:15:01 CET, Mallikarjun <
> > > > mallik.v.ar...@gmail.com> ha scritto:
> > > >
> > > >  We have faced issues with serial replication when one of the region
> > > server
> > > > of either cluster goes into hardware failure, typically memory from
> my
> > > > understanding. I could not spend enough time to reproduce reliably to
> > > > identify the root cause. So I don't know why it is caused.
> > > >
> > > > Issue could be your serial replication has got into deadlock mode
> among
> > > the
> > > > region servers. Who are not able to make any progress because older
> > > > sequence ID is not replicated and older sequence ID is not in front
> of
> > > the
> > > > line to be able to replicate itself.
> > > >
> > > > Quick fix: disable serial replication temporarily so that out of
> > ordering
> > > > is allowed to unblock the replication. Can result into some
> > > inconsistencies
> > > > between clusters which can be fixed using sync table utility since
> your
> > > > setup is active passive
> > > >
> > > > Another fix: delete barriers for each regions in hbase:meta. Same
> > > > consequence as above.
> > > >
> > > >
> > > > On Sun, Dec 12, 2021, 2:24 PM Hamado Dene
> <hamadod...@yahoo.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > I'm using hbase 2.2.6 with hadoop 2.8.5.Yes, My replication serial
> is
> > > > > enabled.This is my peer configuration
> > > > >
> > > > >
> > > > >
> > > > > |
> > > > > | Peer Id | Cluster Key | Endpoint | State | IsSerial | Bandwidth |
> > > > > ReplicateAll | Namespaces | Exclude Namespaces | Table Cfs |
> Exclude
> > > > Table
> > > > > Cfs |
> > > > > | replicav1 | acv-db10-hn,acv-db11-hn,acv-db12-hn:2181:/hbase |  |
> > > > ENABLED
> > > > > | true | UNLIMITED | true
> > > > >
> > > > >  |
> > > > >
> > > > >    Il domenica 12 dicembre 2021, 09:39:44 CET, Mallikarjun <
> > > > > mallik.v.ar...@gmail.com> ha scritto:
> > > > >
> > > > >  Which version of hbase are you using? Is your replication serial
> > > > enabled?
> > > > >
> > > > > ---
> > > > > Mallikarjun
> > > > >
> > > > >
> > > > > On Sun, Dec 12, 2021 at 1:54 PM Hamado Dene
> > > <hamadod...@yahoo.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Hbase community,
> > > > > >
> > > > > > On our production installation we have two hbase clusters in two
> > > > > different
> > > > > > datacenters.The primary datacenter replicates the data to the
> > > secondary
> > > > > > datacenter.When we create the tables, we first create on the
> > > secondary
> > > > > > datacenter and then on the primary and then we set replication
> > scope
> > > > to 1
> > > > > > on the primary.The peer pointing to quorum zk of the secondary
> > > cluster
> > > > is
> > > > > > configured on the primary.
> > > > > > Initially, replication worked fine and data was replicated.We
> have
> > > > > > recently noticed that some tables are empty in the secondary
> > > > datacenter.
> > > > > So
> > > > > > most likely the data is no longer replicated. I'm seeing lines
> like
> > > > this
> > > > > in
> > > > > > the logs:
> > > > > >
> > > > > >
> > > > > > Recovered source for cluster/machine(s) replicav1: Total
> replicated
> > > > > edits:
> > > > > > 0, current progress:walGroup [db11%2C16020%2C1637849866921]:
> > > currently
> > > > > > replicating from:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/db11-hd%2C16020%2C1637849866921.1637849874263
> > > > > > at position: -1
> > > > > > Recovered source for cluster/machine(s) replicav1: Total
> replicated
> > > > > edits:
> > > > > > 0, current progress:walGroup [db09%2C16020%2C1637589840862]:
> > > currently
> > > > > > replicating from:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/db09-hd%2C16020%2C1637589840862.1637589846870
> > > > > > at position: -1
> > > > > > Recovered source for cluster/machine(s) replicav1: Total
> replicated
> > > > > edits:
> > > > > > 0, current progress:walGroup [db13%2C16020%2C1635424806449]:
> > > currently
> > > > > > replicating from:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/db13%2C16020%2C1635424806449.1635424812985
> > > > > > at position: -1
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2021-12-12 09:13:47,148 INFO  [rzv-db09-hd:16020Replication
> > > Statistics
> > > > > #0]
> > > > > > regionserver.Replication: ormal source for cluster replicav1:
> Total
> > > > > > replicated edits: 0, current progress:walGroup
> > > > > > [db09%2C16020%2C1638791923537]: currently replicating from:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db09-hd.rozzano.diennea.lan,16020,1638791923537/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1638791923537.1638791930213
> > > > > > at position: -1
> > > > > > Recovered source for cluster/machine(s) replicav1: Total
> replicated
> > > > > edits:
> > > > > > 0, current progress:walGroup [db09%2C16020%2C1634401671527]:
> > > currently
> > > > > > replicating from:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1634401671527.1634401679218
> > > > > > at position: -1
> > > > > > Recovered source for cluster/machine(s) replicav1: Total
> replicated
> > > > > edits:
> > > > > > 0, current progress:walGroup [db10%2C16020%2C1637585899997]:
> > > currently
> > > > > > replicating from:
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db10-hd.rozzano.diennea.lan%2C16020%2C1637585899997.1637585906625
> > > > > > at position: -1
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2021-12-12 08:24:58,561 WARN
> > > > [regionserver/rzv-db12-hd:16020.logRoller]
> > > > > > regionserver.ReplicationSource: WAL group
> > > db12%2C16020%2C1638790692057
> > > > > > queue size: 187 exceeds value of
> > replication.source.log.queue.warn: 2
> > > > > > Do you have any info on what could be the problem?
> > > > > >
> > > > > > Thanks
> > > > >
> > > >
> > >
> >
>
  

Reply via email to