[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426028#comment-16426028 ] stack commented on HBASE-12386: --- [~apurtell] No worries sir. I have done 10 for any one by anyone else. Just noting these facts in issue as I try to align JIRA and git for branch-2. > Replication gets stuck following a transient zookeeper error to remote peer > cluster > --- > > Key: HBASE-12386 > URL: https://issues.apache.org/jira/browse/HBASE-12386 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.98.7 >Reporter: Adrian Muraru >Assignee: Adrian Muraru >Priority: Major > Fix For: 0.98.8, 0.99.2, 2.0.0 > > Attachments: HBASE-12386-0.98.patch, HBASE-12386.patch > > > Following a transient ZK error replication gets stuck and remote peers are > never updated. > Source region servers are reporting continuously the following error in logs: > "No replication sinks are available" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426009#comment-16426009 ] Andrew Purtell commented on HBASE-12386: Sorry about that. > Replication gets stuck following a transient zookeeper error to remote peer > cluster > --- > > Key: HBASE-12386 > URL: https://issues.apache.org/jira/browse/HBASE-12386 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.98.7 >Reporter: Adrian Muraru >Assignee: Adrian Muraru >Priority: Major > Fix For: 0.98.8, 0.99.2, 2.0.0 > > Attachments: HBASE-12386-0.98.patch, HBASE-12386.patch > > > Following a transient ZK error replication gets stuck and remote peers are > never updated. > Source region servers are reporting continuously the following error in logs: > "No replication sinks are available" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425901#comment-16425901 ] stack commented on HBASE-12386: --- Committed w/o a JIRA ID commit 0505072c5182841ad1a28d798527c69bcc3348f0 Author: Adrian MuraruDate: Thu Oct 30 23:50:02 2014 +0200 Replication gets stuck following a transient zookeeper error to remote peer cluster Signed-off-by: Andrew Purtell > Replication gets stuck following a transient zookeeper error to remote peer > cluster > --- > > Key: HBASE-12386 > URL: https://issues.apache.org/jira/browse/HBASE-12386 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.98.7 >Reporter: Adrian Muraru >Assignee: Adrian Muraru >Priority: Major > Fix For: 0.98.8, 0.99.2, 2.0.0 > > Attachments: HBASE-12386-0.98.patch, HBASE-12386.patch > > > Following a transient ZK error replication gets stuck and remote peers are > never updated. > Source region servers are reporting continuously the following error in logs: > "No replication sinks are available" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190767#comment-14190767 ] Adrian Muraru commented on HBASE-12386: --- Looking at the code it seems that once the remote zk peers lookup fails, the refresh ts is updated and the return list of RS peers is empty. Next time org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager does not retry the lookup on the next polling as the following condition is not met: {code:java} if (endpoint.getLastRegionServerUpdate() this.lastUpdateToPeers) { LOG.info(Current list of sinks is out of date, updating); chooseSinks(); } {code} A fix would be to force a refresh when the list of peers is empty: {code:java} if (replicationPeers.getTimestampOfLastChangeToPeer(peerClusterId) this.lastUpdateToPeers || sinks.isEmpty()) { LOG.info(Current list of sinks is out of date or empty, updating); chooseSinks(); } {code} Note that this is not reproducing in 0.94 where it seems the refresh is happening in this case. Replication gets stuck following a transient zookeeper error to remote peer cluster --- Key: HBASE-12386 URL: https://issues.apache.org/jira/browse/HBASE-12386 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.98.7 Reporter: Adrian Muraru Following a transient ZK error replication gets stuck and remote peers are never updated. Source region servers are reporting continuously the following error in logs: No replication sinks are available -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190926#comment-14190926 ] Ted Yu commented on HBASE-12386: {code} +if (endpoint.getLastRegionServerUpdate() this.lastUpdateToPeers || sinks.isEmpty()) { + LOG.info(Current list of sinks is out of date or empty, updating); {code} It would helpful if the condition (list out of date or empty) is stated clearly in the log message. Replication gets stuck following a transient zookeeper error to remote peer cluster --- Key: HBASE-12386 URL: https://issues.apache.org/jira/browse/HBASE-12386 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.98.7 Reporter: Adrian Muraru Attachments: HBASE-12386.patch Following a transient ZK error replication gets stuck and remote peers are never updated. Source region servers are reporting continuously the following error in logs: No replication sinks are available -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190960#comment-14190960 ] Lars Hofhansl commented on HBASE-12386: --- {{Current list of sinks is out of date or empty, updating}} seems clear enough to me. +1 on patch. One thing we have to think through is what happens when the slave cluster is down for a bit. We'd chose sinks again on each call. I think that's OK especially since we dialed down the retry interval to 5mins recently after a bit. Also, we can still be a bad situation where RegionServers die and restart at the slave cluster, we could go down to a single RS at the peers before we try to choose sinks again. That's for another issue. Replication gets stuck following a transient zookeeper error to remote peer cluster --- Key: HBASE-12386 URL: https://issues.apache.org/jira/browse/HBASE-12386 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.98.7 Reporter: Adrian Muraru Attachments: HBASE-12386.patch Following a transient ZK error replication gets stuck and remote peers are never updated. Source region servers are reporting continuously the following error in logs: No replication sinks are available -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster
[ https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191048#comment-14191048 ] Hadoop QA commented on HBASE-12386: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12678323/HBASE-12386.patch against trunk revision . ATTACHMENT ID: 12678323 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestCoprocessorHConnection Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11528//console This message is automatically generated. Replication gets stuck following a transient zookeeper error to remote peer cluster --- Key: HBASE-12386 URL: https://issues.apache.org/jira/browse/HBASE-12386 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.98.7 Reporter: Adrian Muraru Attachments: HBASE-12386.patch Following a transient ZK error replication gets stuck and remote peers are never updated. Source region servers are reporting continuously the following error in logs: No replication sinks are available -- This message was sent by Atlassian JIRA (v6.3.4#6332)