[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034088#comment-16034088
 ] 

Hudson commented on HBASE-18111:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3119 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/3119/])
HBASE-18111 Replication stuck when cluster connection is closed (apurtell: rev 
549171465db83600dbdf6d0b5af01c3ad30dc550)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java


> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-18111.patch, HBASE-18111-v1.patch, 
> HBASE-18111-v2.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-06-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034047#comment-16034047
 ] 

Hudson commented on HBASE-18111:


SUCCESS: Integrated in Jenkins build HBase-1.4 #756 (See 
[https://builds.apache.org/job/HBase-1.4/756/])
HBASE-18111 Replication stuck when cluster connection is closed (apurtell: rev 
b66a478e73b59e0335de89038475409d916ea164)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java


> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-18111.patch, HBASE-18111-v1.patch, 
> HBASE-18111-v2.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-06-01 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1602#comment-1602
 ] 

Andrew Purtell commented on HBASE-18111:


Failure is unrelated. Unless objection I will commit the v2 patch to master and 
branch-1 shortly. 

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch, HBASE-18111-v1.patch, 
> HBASE-18111-v2.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-06-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032546#comment-16032546
 ] 

Hadoop QA commented on HBASE-18111:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
41s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 22s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
30s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
15s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
58m 44s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
35s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 209m 11s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
30s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 294m 39s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hbase.master.procedure.TestMasterProcedureWalLease |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:757bf37 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12870710/HBASE-18111-v2.patch |
| JIRA Issue | HBASE-18111 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 85aed907a4c7 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / db8ce05 |
| Default Java | 1.8.0_131 |
| findbugs | v3.0.0 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7031/artifact/patchprocess/patch-unit-hbase-server.txt
 |
| unit test logs |  
https://builds.apache.org/job/PreCommit-HBASE-Build/7031/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7031/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/7031/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> Replication 

[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024313#comment-16024313
 ] 

Hadoop QA commented on HBASE-18111:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 27s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
22s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 32s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
30s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
38s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
63m 26s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
57s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 3s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 211m 15s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 
9s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 304m 45s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.regionserver.TestRegionReplicaFailover |
|   | hadoop.hbase.client.TestAsyncBalancerAdminApi |
|   | hadoop.hbase.client.TestAsyncTableAdminApi |
|   | hadoop.hbase.regionserver.TestPerColumnFamilyFlush |
|   | hadoop.hbase.security.access.TestCoprocessorWhitelistMasterObserver |
|   | hadoop.hbase.regionserver.TestCorruptedRegionStoreFile |
|   | hadoop.hbase.client.TestAsyncSnapshotAdminApi |
| Timed out junit tests | org.apache.hadoop.hbase.client.TestFromClientSide |
|   | org.apache.hadoop.hbase.replication.regionserver.TestWALEntryStream |
|   | org.apache.hadoop.hbase.snapshot.TestMobExportSnapshot |
|   | org.apache.hadoop.hbase.TestIOFencing |
|   | org.apache.hadoop.hbase.filter.TestFuzzyRowFilterEndToEnd |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:757bf37 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12869774/HBASE-18111.patch |
| JIRA Issue | HBASE-18111 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux c7011ddf95c4 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git 

[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024265#comment-16024265
 ] 

Hadoop QA commented on HBASE-18111:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
44s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
55s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
11s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
51s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
0s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
35m 56s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 123m 9s 
{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 174m 4s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:757bf37 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12869791/HBASE-18111-v1.patch |
| JIRA Issue | HBASE-18111 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 1f4189d9d510 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 
15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 8b75e9e |
| Default Java | 1.8.0_131 |
| findbugs | v3.0.0 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6937/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/6937/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: 

[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-24 Thread Guanghao Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024129#comment-16024129
 ] 

Guanghao Zhang commented on HBASE-18111:


bq. Have you taken look at ZOOKEEPER-2785?
It maybe a reason. HBaseInterClusterReplicationEndpoint is a hbase client and 
write entries to peer cluster. We should handle the connection close case no 
matter what reason lead it? And now the replication will stuck in the while 
loop. You have to restart the RS and let another RS help to replicate the 
log..

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024100#comment-16024100
 ] 

Ted Yu commented on HBASE-18111:


Have you taken look at ZOOKEEPER-2785 ?

It was fixed recently.

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-24 Thread Guanghao Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024096#comment-16024096
 ] 

Guanghao Zhang commented on HBASE-18111:


bq.  but there is no need to close null connection.
Got it. Thanks.
bq. is there some action necessary for the reconnection to succeed ?
You mean kerberos relogin? In our use case, it happened when adjust network. I 
didn't know the real reason about the AuthFailed. But the unstable network may 
lead to it. There are no problem about our regionserver and only replication 
stuck. So I thought the reconnect will work for this case.

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024074#comment-16024074
 ] 

Ted Yu commented on HBASE-18111:


bq. The IOUtils.closeQuietly will handle the null case.

I know that - but there is no need to close null connection.

For reconnection, after AuthFailed event, is there some action necessary for 
the reconnection to succeed ?

Thanks

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-24 Thread Guanghao Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024056#comment-16024056
 ] 

Guanghao Zhang commented on HBASE-18111:


bq. When IOE is encountered, connection would be null.
The IOUtils.closeQuietly will handle the null case.
bq. How does the reconnect method tell caller whether reconnection is 
successful ?
There are a while loop outer layer. So if reconnect failed, it will replicate 
failed and continue to reconnect. So reconnectToPeerCluster() method doesn't 
have to reconnect succeed. Thanks.

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HBASE-18111) Replication stuck when cluster connection is closed

2017-05-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024038#comment-16024038
 ] 

Ted Yu commented on HBASE-18111:


{code}
194   IOUtils.closeQuietly(connection);
{code}
When IOE is encountered, connection would be null. Why is the above needed ?

How does the reconnect method tell caller whether reconnection is successful ?

> Replication stuck when cluster connection is closed
> ---
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>Reporter: Guanghao Zhang
>Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] 
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum 
> member failed: javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper 
> Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED 
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: 
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6, 
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
>  baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] 
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper 
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local 
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a 
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() 
> method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)