[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2018-04-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426028#comment-16426028
 ] 

stack commented on HBASE-12386:
---

[~apurtell] No worries sir. I have done 10 for any one by anyone else. Just 
noting these facts in issue as I try to align JIRA and git for branch-2.

> Replication gets stuck following a transient zookeeper error to remote peer 
> cluster
> ---
>
> Key: HBASE-12386
> URL: https://issues.apache.org/jira/browse/HBASE-12386
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 0.98.7
>Reporter: Adrian Muraru
>Assignee: Adrian Muraru
>Priority: Major
> Fix For: 0.98.8, 0.99.2, 2.0.0
>
> Attachments: HBASE-12386-0.98.patch, HBASE-12386.patch
>
>
> Following a transient ZK error replication gets stuck and remote peers are 
> never updated.
> Source region servers are reporting continuously the following error in logs:
> "No replication sinks are available"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2018-04-04 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426009#comment-16426009
 ] 

Andrew Purtell commented on HBASE-12386:


Sorry about that.

> Replication gets stuck following a transient zookeeper error to remote peer 
> cluster
> ---
>
> Key: HBASE-12386
> URL: https://issues.apache.org/jira/browse/HBASE-12386
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 0.98.7
>Reporter: Adrian Muraru
>Assignee: Adrian Muraru
>Priority: Major
> Fix For: 0.98.8, 0.99.2, 2.0.0
>
> Attachments: HBASE-12386-0.98.patch, HBASE-12386.patch
>
>
> Following a transient ZK error replication gets stuck and remote peers are 
> never updated.
> Source region servers are reporting continuously the following error in logs:
> "No replication sinks are available"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2018-04-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425901#comment-16425901
 ] 

stack commented on HBASE-12386:
---

Committed w/o a JIRA ID

commit 0505072c5182841ad1a28d798527c69bcc3348f0
Author: Adrian Muraru 
Date:   Thu Oct 30 23:50:02 2014 +0200

Replication gets stuck following a transient zookeeper error to remote peer 
cluster

Signed-off-by: Andrew Purtell 



> Replication gets stuck following a transient zookeeper error to remote peer 
> cluster
> ---
>
> Key: HBASE-12386
> URL: https://issues.apache.org/jira/browse/HBASE-12386
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 0.98.7
>Reporter: Adrian Muraru
>Assignee: Adrian Muraru
>Priority: Major
> Fix For: 0.98.8, 0.99.2, 2.0.0
>
> Attachments: HBASE-12386-0.98.patch, HBASE-12386.patch
>
>
> Following a transient ZK error replication gets stuck and remote peers are 
> never updated.
> Source region servers are reporting continuously the following error in logs:
> "No replication sinks are available"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2014-10-30 Thread Adrian Muraru (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190767#comment-14190767
 ] 

Adrian Muraru commented on HBASE-12386:
---

Looking at the code it seems that once the remote zk peers lookup fails, the 
refresh ts is updated and the return list of RS peers is empty.

Next time 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager does 
not retry the lookup on the next polling as the following condition is not met:
{code:java}
if (endpoint.getLastRegionServerUpdate()  this.lastUpdateToPeers) {
  LOG.info(Current list of sinks is out of date, updating);
  chooseSinks();
}
{code}

A fix would be to force a refresh when the list of peers is empty:
{code:java}
if (replicationPeers.getTimestampOfLastChangeToPeer(peerClusterId)  
this.lastUpdateToPeers
|| sinks.isEmpty()) {
  LOG.info(Current list of sinks is out of date or empty, updating);
  chooseSinks();
}
{code}

Note that this is not reproducing in 0.94 where it seems the refresh is 
happening in this case.


 Replication gets stuck following a transient zookeeper error to remote peer 
 cluster
 ---

 Key: HBASE-12386
 URL: https://issues.apache.org/jira/browse/HBASE-12386
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.7
Reporter: Adrian Muraru

 Following a transient ZK error replication gets stuck and remote peers are 
 never updated.
 Source region servers are reporting continuously the following error in logs:
 No replication sinks are available



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2014-10-30 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190926#comment-14190926
 ] 

Ted Yu commented on HBASE-12386:


{code}
+if (endpoint.getLastRegionServerUpdate()  this.lastUpdateToPeers || 
sinks.isEmpty()) {
+  LOG.info(Current list of sinks is out of date or empty, updating);
{code}
It would helpful if the condition (list out of date or empty) is stated clearly 
in the log message.

 Replication gets stuck following a transient zookeeper error to remote peer 
 cluster
 ---

 Key: HBASE-12386
 URL: https://issues.apache.org/jira/browse/HBASE-12386
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.7
Reporter: Adrian Muraru
 Attachments: HBASE-12386.patch


 Following a transient ZK error replication gets stuck and remote peers are 
 never updated.
 Source region servers are reporting continuously the following error in logs:
 No replication sinks are available



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2014-10-30 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190960#comment-14190960
 ] 

Lars Hofhansl commented on HBASE-12386:
---

{{Current list of sinks is out of date or empty, updating}} seems clear 
enough to me.

+1 on patch.

One thing we have to think through is what happens when the slave cluster is 
down for a bit. We'd chose sinks again on each call. I think that's OK 
especially since we dialed down the retry interval to 5mins recently after a 
bit.

Also, we can still be a bad situation where RegionServers die and restart at 
the slave cluster, we could go down to a single RS at the peers before we try 
to choose sinks again. That's for another issue.

 Replication gets stuck following a transient zookeeper error to remote peer 
 cluster
 ---

 Key: HBASE-12386
 URL: https://issues.apache.org/jira/browse/HBASE-12386
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.7
Reporter: Adrian Muraru
 Attachments: HBASE-12386.patch


 Following a transient ZK error replication gets stuck and remote peers are 
 never updated.
 Source region servers are reporting continuously the following error in logs:
 No replication sinks are available



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

2014-10-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191048#comment-14191048
 ] 

Hadoop QA commented on HBASE-12386:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12678323/HBASE-12386.patch
  against trunk revision .
  ATTACHMENT ID: 12678323

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.coprocessor.TestCoprocessorHConnection

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11528//console

This message is automatically generated.

 Replication gets stuck following a transient zookeeper error to remote peer 
 cluster
 ---

 Key: HBASE-12386
 URL: https://issues.apache.org/jira/browse/HBASE-12386
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.98.7
Reporter: Adrian Muraru
 Attachments: HBASE-12386.patch


 Following a transient ZK error replication gets stuck and remote peers are 
 never updated.
 Source region servers are reporting continuously the following error in logs:
 No replication sinks are available



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)