Hi HBase community:

We have been testing cyclic replication for 1 week. The basic functionality 
seems to work as described in the document however when we started to increase 
the write workload, the replication starts to miss data (i.e. some data are not 
replicated to the other cluster). We have narrowed down to a scenario that we 
can reproduce the problem quite consistently and here it is:

-----------------------------
Setup:
- We have setup 2 clusters (cluster A and cluster B)with identical size in 
terms of number of nodes and configuration, 3 regionservers sit on top of 3 
datanodes. 
- Cyclic replication is enabled.

- We use YCSB to generate load to hbase the workload is very similar to 
workloada:

recordcount=200000
operationcount=200000
workload=com.yahoo.ycsb.workloads.CoreWorkload
fieldcount=1
fieldlength=25000
 
readallfields=true
writeallfields=true
 
readproportion=0
updateproportion=1
scanproportion=0
insertproportion=0
 
requestdistribution=uniform
 
- Records are inserted into Cluster A. After the benchmark is done and wait 
until all data are replicated to Cluster B, we used verifyrep mapreduce job for 
validation.
- Data are deleted from both table (truncate 'tablename') before a new 
experiment is started.

Scenario:
when we increase the number of threads until it max out the throughput of the 
cluster, we saw some data are missing in Cluster B (total count != 200000) 
although cluster A clearly has them all. This happens even though we disabled 
region splitting in both clusters (it happens more often when region splits 
occur). To further having more control of what is happening, we then decided to 
disable the load balancer so the region (which is responsible for the 
replicating data) will not relocate to other regionserver during the benchmark. 
The situation improves a lot. We don't see any missing data in 5 continuous 
runs. Finally, we decided to move the region around from a regionserver to 
another regionserver during the benchmark to see if the problem will reappear 
and it did. 

We believe that the issue could be related to region splitting and load 
balancing during intensive write, the hbase replication strategy hasn't yet 
cover those corner cases. 

Can someone take a look of it and suggest some ways to workaround this? 

Thanks~

Jerry

Reply via email to