Thanks Duo I will patch this and verify for the issue I mentioned above.
On Sun, Dec 12, 2021, 8:06 PM 张铎(Duo Zhang) <palomino...@gmail.com> wrote: > We have fixed several replication related issues which may cause data loss, > for example, this one > > https://issues.apache.org/jira/browse/HBASE-26482 > > For serial replication, if we miss some wal files, it usually causes > replication to be stuck... > > Mallikarjun <mallik.v.ar...@gmail.com> 于2021年12月12日周日 18:19写道: > > > Sync table is to be run manually when you think there can be > > inconsistencies between the 2 clusters only for specific time period. > > > > As soon as you disable serial replication, it should start replicating > from > > the time it was stuck. You can build dashboards from jmx metrics > generated > > from hmaster to know about these and setup alerts as well. > > > > > > > > On Sun, Dec 12, 2021, 3:33 PM Hamado Dene <hamadod...@yahoo.com.invalid> > > wrote: > > > > > Ok perfect.How often should this sync run? I guess in this case you > have > > > to automate it somehow, correct? > > > Since I will have to disable serial mode, do I first have to align > tables > > > manually or the moment I disable serial mode, the regionservers will > > start > > > replicating from where they were blocked? > > > > > > > > > Il domenica 12 dicembre 2021, 10:55:05 CET, Mallikarjun < > > > mallik.v.ar...@gmail.com> ha scritto: > > > > > > https://hbase.apache.org/book.html#hashtable.synctable > > > > > > To copy the difference between tables for a specific time period. > > > > > > On Sun, Dec 12, 2021, 3:12 PM Hamado Dene <hamadod...@yahoo.com.invalid > > > > > wrote: > > > > > > > Interesting, thank you very much for the info. I'll try to disable > > serial > > > > replication.As for "sync table utility" what do you mean?I am new to > > > Hbase, > > > > I am not yet familiar with all Hbase tools. > > > > > > > > > > > > > > > > Il domenica 12 dicembre 2021, 10:15:01 CET, Mallikarjun < > > > > mallik.v.ar...@gmail.com> ha scritto: > > > > > > > > We have faced issues with serial replication when one of the region > > > server > > > > of either cluster goes into hardware failure, typically memory from > my > > > > understanding. I could not spend enough time to reproduce reliably to > > > > identify the root cause. So I don't know why it is caused. > > > > > > > > Issue could be your serial replication has got into deadlock mode > among > > > the > > > > region servers. Who are not able to make any progress because older > > > > sequence ID is not replicated and older sequence ID is not in front > of > > > the > > > > line to be able to replicate itself. > > > > > > > > Quick fix: disable serial replication temporarily so that out of > > ordering > > > > is allowed to unblock the replication. Can result into some > > > inconsistencies > > > > between clusters which can be fixed using sync table utility since > your > > > > setup is active passive > > > > > > > > Another fix: delete barriers for each regions in hbase:meta. Same > > > > consequence as above. > > > > > > > > > > > > On Sun, Dec 12, 2021, 2:24 PM Hamado Dene > <hamadod...@yahoo.com.invalid > > > > > > > wrote: > > > > > > > > > I'm using hbase 2.2.6 with hadoop 2.8.5.Yes, My replication serial > is > > > > > enabled.This is my peer configuration > > > > > > > > > > > > > > > > > > > > | > > > > > | Peer Id | Cluster Key | Endpoint | State | IsSerial | Bandwidth | > > > > > ReplicateAll | Namespaces | Exclude Namespaces | Table Cfs | > Exclude > > > > Table > > > > > Cfs | > > > > > | replicav1 | acv-db10-hn,acv-db11-hn,acv-db12-hn:2181:/hbase | | > > > > ENABLED > > > > > | true | UNLIMITED | true > > > > > > > > > > | > > > > > > > > > > Il domenica 12 dicembre 2021, 09:39:44 CET, Mallikarjun < > > > > > mallik.v.ar...@gmail.com> ha scritto: > > > > > > > > > > Which version of hbase are you using? Is your replication serial > > > > enabled? > > > > > > > > > > --- > > > > > Mallikarjun > > > > > > > > > > > > > > > On Sun, Dec 12, 2021 at 1:54 PM Hamado Dene > > > <hamadod...@yahoo.com.invalid > > > > > > > > > > wrote: > > > > > > > > > > > Hi Hbase community, > > > > > > > > > > > > On our production installation we have two hbase clusters in two > > > > > different > > > > > > datacenters.The primary datacenter replicates the data to the > > > secondary > > > > > > datacenter.When we create the tables, we first create on the > > > secondary > > > > > > datacenter and then on the primary and then we set replication > > scope > > > > to 1 > > > > > > on the primary.The peer pointing to quorum zk of the secondary > > > cluster > > > > is > > > > > > configured on the primary. > > > > > > Initially, replication worked fine and data was replicated.We > have > > > > > > recently noticed that some tables are empty in the secondary > > > > datacenter. > > > > > So > > > > > > most likely the data is no longer replicated. I'm seeing lines > like > > > > this > > > > > in > > > > > > the logs: > > > > > > > > > > > > > > > > > > Recovered source for cluster/machine(s) replicav1: Total > replicated > > > > > edits: > > > > > > 0, current progress:walGroup [db11%2C16020%2C1637849866921]: > > > currently > > > > > > replicating from: > > > > > > > > > > > > > > > > > > > > > hdfs://rozzanohadoopcluster/hbase/oldWALs/db11-hd%2C16020%2C1637849866921.1637849874263 > > > > > > at position: -1 > > > > > > Recovered source for cluster/machine(s) replicav1: Total > replicated > > > > > edits: > > > > > > 0, current progress:walGroup [db09%2C16020%2C1637589840862]: > > > currently > > > > > > replicating from: > > > > > > > > > > > > > > > > > > > > > hdfs://rozzanohadoopcluster/hbase/oldWALs/db09-hd%2C16020%2C1637589840862.1637589846870 > > > > > > at position: -1 > > > > > > Recovered source for cluster/machine(s) replicav1: Total > replicated > > > > > edits: > > > > > > 0, current progress:walGroup [db13%2C16020%2C1635424806449]: > > > currently > > > > > > replicating from: > > > > > > > > > > > > > > > > > > > > > hdfs://rozzanohadoopcluster/hbase/oldWALs/db13%2C16020%2C1635424806449.1635424812985 > > > > > > at position: -1 > > > > > > > > > > > > > > > > > > > > > > > > 2021-12-12 09:13:47,148 INFO [rzv-db09-hd:16020Replication > > > Statistics > > > > > #0] > > > > > > regionserver.Replication: ormal source for cluster replicav1: > Total > > > > > > replicated edits: 0, current progress:walGroup > > > > > > [db09%2C16020%2C1638791923537]: currently replicating from: > > > > > > > > > > > > > > > > > > > > > hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db09-hd.rozzano.diennea.lan,16020,1638791923537/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1638791923537.1638791930213 > > > > > > at position: -1 > > > > > > Recovered source for cluster/machine(s) replicav1: Total > replicated > > > > > edits: > > > > > > 0, current progress:walGroup [db09%2C16020%2C1634401671527]: > > > currently > > > > > > replicating from: > > > > > > > > > > > > > > > > > > > > > hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1634401671527.1634401679218 > > > > > > at position: -1 > > > > > > Recovered source for cluster/machine(s) replicav1: Total > replicated > > > > > edits: > > > > > > 0, current progress:walGroup [db10%2C16020%2C1637585899997]: > > > currently > > > > > > replicating from: > > > > > > > > > > > > > > > > > > > > > hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db10-hd.rozzano.diennea.lan%2C16020%2C1637585899997.1637585906625 > > > > > > at position: -1 > > > > > > > > > > > > > > > > > > > > > > > > 2021-12-12 08:24:58,561 WARN > > > > [regionserver/rzv-db12-hd:16020.logRoller] > > > > > > regionserver.ReplicationSource: WAL group > > > db12%2C16020%2C1638790692057 > > > > > > queue size: 187 exceeds value of > > replication.source.log.queue.warn: 2 > > > > > > Do you have any info on what could be the problem? > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > >