Jerry: Currently TestReplicationPeer and TestReplication don't involve load balancing scenario. If you can write a test where load balancer re-assigns some regions, that would help us pinpoint the problem.
Cheers On Fri, Apr 20, 2012 at 12:34 PM, lars hofhansl <[email protected]> wrote: > Hi Jerry, > > which version of HBase are you using? > > You are not using cyclic backup, that needs >2 clusters. I assume you're > just replicating from one cluster to another, right? > > There is never data loss in Cluster A? > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <[email protected]> > To: [email protected] > Cc: > Sent: Friday, April 20, 2012 5:38 AM > Subject: HBase Cyclic Replication Issue: some data are missing in the > replication for intensive write > > Hi HBase community: > > We have been testing cyclic replication for 1 week. The basic > functionality seems to work as described in the document however when we > started to increase the write workload, the replication starts to miss data > (i.e. some data are not replicated to the other cluster). We have narrowed > down to a scenario that we can reproduce the problem quite consistently and > here it is: > > ----------------------------- > Setup: > - We have setup 2 clusters (cluster A and cluster B)with identical size in > terms of number of nodes and configuration, 3 regionservers sit on top of 3 > datanodes. > - Cyclic replication is enabled. > > - We use YCSB to generate load to hbase the workload is very similar to > workloada: > > recordcount=200000 > operationcount=200000 > workload=com.yahoo.ycsb.workloads.CoreWorkload > fieldcount=1 > fieldlength=25000 > > readallfields=true > writeallfields=true > > readproportion=0 > updateproportion=1 > scanproportion=0 > insertproportion=0 > > requestdistribution=uniform > > - Records are inserted into Cluster A. After the benchmark is done and > wait until all data are replicated to Cluster B, we used verifyrep > mapreduce job for validation. > - Data are deleted from both table (truncate 'tablename') before a new > experiment is started. > > Scenario: > when we increase the number of threads until it max out the throughput of > the cluster, we saw some data are missing in Cluster B (total count != > 200000) although cluster A clearly has them all. This happens even though > we disabled region splitting in both clusters (it happens more often when > region splits occur). To further having more control of what is happening, > we then decided to disable the load balancer so the region (which is > responsible for the replicating data) will not relocate to other > regionserver during the benchmark. The situation improves a lot. We don't > see any missing data in 5 continuous runs. Finally, we decided to move the > region around from a regionserver to another regionserver during the > benchmark to see if the problem will reappear and it did. > > We believe that the issue could be related to region splitting and load > balancing during intensive write, the hbase replication strategy hasn't yet > cover those corner cases. > > Can someone take a look of it and suggest some ways to workaround this? > > Thanks~ > > Jerry >
