Gabriel Reid created HBASE-7634:
-----------------------------------

             Summary: Replication handling of changes to peer clusters is 
inefficient
                 Key: HBASE-7634
                 URL: https://issues.apache.org/jira/browse/HBASE-7634
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 0.96.0
            Reporter: Gabriel Reid
         Attachments: HBASE-7634.patch

The current handling of changes to the region servers in a replication peer 
cluster is currently quite inefficient. The list of region servers that are 
being replicated to is only updated if there are a large number of issues 
encountered while replicating.

This can cause it to take quite a while to recognize that a number of the 
regionserver in a peer cluster are no longer available. A potentially bigger 
problem is that if a replication peer cluster is started with a small number of 
regionservers, and then more region servers are added after replication has 
started, the additional region servers will never be used for replication 
(unless there are failures on the in-use regionservers).



Part of the current issue is that the retry code in ReplicationSource#shipEdits 
checks a randomly-chosen replication peer regionserver (in 
ReplicationSource#isSlaveDown) to see if it is up after a replication write has 
failed on a different randonly-chosen replication peer. If the peer is seen as 
not down, another randomly-chosen peer is used for writing.



A second part of the issue is that changes to the list of region servers in a 
peer cluster are not detected at all, and are only picked up if a certain 
number of failures have occurred when trying to ship edits.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to