[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512980#comment-16512980
 ] 

Hudson commented on HBASE-20561:


Results for branch master
[build #365 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/365/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/365//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/365//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/365//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch, HBASE-20561.master.003.patch, 
> HBASE-20561.master.004.patch, HBASE-20561.master.005.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511353#comment-16511353
 ] 

Hudson commented on HBASE-20561:


Results for branch branch-2
[build #858 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/858/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/858//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/858//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/858//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch, HBASE-20561.master.003.patch, 
> HBASE-20561.master.004.patch, HBASE-20561.master.005.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510861#comment-16510861
 ] 

Hadoop QA commented on HBASE-20561:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
27s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 9s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
15s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 9s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 54s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
42s{color} | {color:green} hbase-zookeeper in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}114m 
11s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}156m 12s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-20561 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12927596/HBASE-20561.master.005.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 6b776059eb73 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 
13:48:03 UTC 2016 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 8648af07d4 |
| maven | version: 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-13 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510670#comment-16510670
 ] 

Duo Zhang commented on HBASE-20561:
---

OK. +1.

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch, HBASE-20561.master.003.patch, 
> HBASE-20561.master.004.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.setWALPosition(ZKReplicationQueueStorage.java:246)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-12 Thread Guanghao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510583#comment-16510583
 ] 

Guanghao Zhang commented on HBASE-20561:


This case only happened when refresh peer source. As there are lock between 
remove peer and refresh peer. So no need to use interruptOrAbortWhenFail when 
remove peer.

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch, HBASE-20561.master.003.patch, 
> HBASE-20561.master.004.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>   at 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-12 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510464#comment-16510464
 ] 

Duo Zhang commented on HBASE-20561:
---

Where do we still call the old abortWhenFail method? Is it better to change 
them all to interruptOrAbortWhenFail?

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch, HBASE-20561.master.003.patch, 
> HBASE-20561.master.004.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-06-12 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509891#comment-16509891
 ] 

Hadoop QA commented on HBASE-20561:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
47s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
16s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
5s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m  
9s{color} | {color:red} hbase-server: The patch generated 1 new + 4 unchanged - 
0 fixed = 5 total (was 4) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
49s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 58s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
44s{color} | {color:green} hbase-zookeeper in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}121m 
43s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}166m 28s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-20561 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12927467/HBASE-20561.master.004.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux fb97ef742f8f 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480590#comment-16480590
 ] 

Hadoop QA commented on HBASE-20561:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
32s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
43s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  6m 
 1s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} The patch hbase-zookeeper passed checkstyle {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
12s{color} | {color:green} hbase-replication: The patch generated 0 new + 2 
unchanged - 1 fixed = 2 total (was 3) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m  
9s{color} | {color:red} hbase-server: The patch generated 1 new + 4 unchanged - 
0 fixed = 5 total (was 4) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
48s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
14m 40s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
43s{color} | {color:green} hbase-zookeeper in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
22s{color} | {color:green} hbase-replication in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}102m 
24s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
57s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}164m 45s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:d8b550f 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-18 Thread Guanghao Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480418#comment-16480418
 ] 

Guanghao Zhang commented on HBASE-20561:


Add a 003 patch. Only need handle the interrupt exception for 
logPositionAndCleanOldLogs, cleanOldLogs and cleanUpHFileRefs. For other 
operations, if they throw InterruptedException, terminate source is not the 
reason. So keep them in old logic and abort server directly.

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch, HBASE-20561.master.003.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-17 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480097#comment-16480097
 ] 

Duo Zhang commented on HBASE-20561:
---

OK, checked the code, the SystemErrorException is thrown by us in ZKWatcher. 
Then let's call initCause to save the InterruptedException?

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-17 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479013#comment-16479013
 ] 

Duo Zhang commented on HBASE-20561:
---

Much better now. The only concern is that is it safe to convert all 
SystemErrorException to InterruptedException?

And maybe we could also add a graceful wait interval before interrupting the 
thread when stop a ReplicationSource?

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch, 
> HBASE-20561.master.002.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478904#comment-16478904
 ] 

Hadoop QA commented on HBASE-20561:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
1s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
41s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
25s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
56s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} hbase-replication: The patch generated 0 new + 2 
unchanged - 1 fixed = 2 total (was 3) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
11s{color} | {color:red} hbase-server: The patch generated 1 new + 4 unchanged 
- 0 fixed = 5 total (was 4) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
50s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
14m 42s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
21s{color} | {color:green} hbase-replication in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}106m 
51s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}156m 35s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:d8b550f |
| JIRA Issue | HBASE-20561 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12923878/HBASE-20561.master.002.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 749921cc6653 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-16 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478375#comment-16478375
 ] 

Duo Zhang commented on HBASE-20561:
---

{quote}
Checked code. ZK impl will restore the flag.
{quote}

But we may have other implementations? I mean that depending on this flag is 
not stable...

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.setWALPosition(ZKReplicationQueueStorage.java:246)
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-16 Thread Guanghao Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478368#comment-16478368
 ] 

Guanghao Zhang commented on HBASE-20561:


{quote}Check KeeperException in ReplicationSourceManager is a bit strange
{quote}
Maybe a special ReplicationException throw from storage layer?
{quote}Not sure whether the zookeeper implementation will restore the 
interrupted flag..
{quote}
Checked code. ZK impl will restore the flag.

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-16 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478323#comment-16478323
 ] 

Duo Zhang commented on HBASE-20561:
---

Check KeeperException in ReplicationSourceManager is a bit strange, as we have 
a storage interface layer which hides the implementation detail, that's why we 
use ReplicationException instead of KeeperException.

And is it safe to use Thread.isInterrupted? Not sure whether the zookeeper 
implementation will restore the interrupted flag...

> The way we stop a ReplicationSource may cause the RS down
> -
>
> Key: HBASE-20561
> URL: https://issues.apache.org/jira/browse/HBASE-20561
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Duo Zhang
>Assignee: Guanghao Zhang
>Priority: Major
> Attachments: HBASE-20561.master.001.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>   at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:502)
>   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>   at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>   at 
> 

[jira] [Commented] (HBASE-20561) The way we stop a ReplicationSource may cause the RS down

2018-05-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477488#comment-16477488
 ] 

Hadoop QA commented on HBASE-20561:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
21s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
9s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  7m 
30s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
17s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  7m 
36s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
20m 52s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}133m 
36s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}203m 45s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:d8b550f |
| JIRA Issue | HBASE-20561 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12923650/HBASE-20561.master.001.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 791bb4539274 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / ab53329cb3 |
| maven | version: Apache Maven 3.5.3 
(3383c37e1f9e9b3bc3df5050c29c8aff9f295297; 2018-02-24T19:49:05Z) |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/12837/testReport/ |
| Max. process+thread count | 4099 (vs. ulimit of 1) |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/12837/console |
| Powered