[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902094#comment-15902094 ] Hudson commented on HBASE-17718: FAILURE: Integrated in Jenkins build HBase-1.4 #661 (See [https://builds.apache.org/job/HBase-1.4/661/]) HBASE-17718 Difference between RS's servername and its ephemeral node (stack: rev ca5b8a44a490143536625502d5d88918f338f562) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenInitializing.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerListener.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/zookeeper/DrainingServerTracker.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureEvents.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentListener.java > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Fix For: 2.0.0, 1.4.0 > > Attachments: 0001-HBASE-17718-amendment.patch, > HBASE-17718.branch-1.001.patch, HBASE-17718.branch-1.002.patch, > HBASE-17718.branch-1.003.patch, HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch, HBASE-17718.master.003.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901952#comment-15901952 ] Hudson commented on HBASE-17718: SUCCESS: Integrated in Jenkins build HBase-Trunk_matrix #2636 (See [https://builds.apache.org/job/HBase-Trunk_matrix/2636/]) HBASE-17718 Difference between RS's servername and its ephemeral node (stack: rev 6a57050c24100438508199c9856b95be7024803a) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerListener.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenInitializing.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/zookeeper/DrainingServerTracker.java > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Fix For: 2.0.0, 1.4.0 > > Attachments: 0001-HBASE-17718-amendment.patch, > HBASE-17718.branch-1.001.patch, HBASE-17718.branch-1.002.patch, > HBASE-17718.branch-1.003.patch, HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch, HBASE-17718.master.003.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901724#comment-15901724 ] Hadoop QA commented on HBASE-17718: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 38s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 55s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} branch-1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 0s {color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s {color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s {color} | {color:green} the patch passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s {color} | {color:green} the patch passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 16m 48s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s {color} | {color:green} the patch passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s {color} | {color:green} the patch passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 98m 0s {color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 130m 2s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:e01ee2f | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856830/HBASE-17718.branch-1.003.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux bc67b739598a 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1 / f34709e | | Default Java | 1.7.0_80 | | Multi-JDK
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901498#comment-15901498 ] stack commented on HBASE-17718: --- Pushed the amendment. Here comes new branch-1 patch that includes the master amendment (Test is pretty different since master and branch-1 profiles are so different -- master carrying hbase:meta messes us up). > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: 0001-HBASE-17718-amendment.patch, > HBASE-17718.branch-1.001.patch, HBASE-17718.branch-1.002.patch, > HBASE-17718.branch-1.003.patch, HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch, HBASE-17718.master.003.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901417#comment-15901417 ] Hadoop QA commented on HBASE-17718: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 20s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 43s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 26m 56s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 96m 26s {color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 135m 54s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:8d52d23 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856805/0001-HBASE-17718-amendment.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 173b0a91416a 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 58c7619 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/6007/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/6007/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments:
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900598#comment-15900598 ] Hadoop QA commented on HBASE-17718: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 46s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 49s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} branch-1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 43s {color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 14m 26s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s {color} | {color:green} the patch passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 85m 38s {color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 113m 27s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:e01ee2f | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856710/HBASE-17718.branch-1.002.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux afa629beddc0 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1 / 5f63093 | | Default Java | 1.7.0_80 | | Multi-JDK versions |
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898452#comment-15898452 ] Hadoop QA commented on HBASE-17718: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 23s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s {color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s {color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 20s {color} | {color:green} branch-1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 22s {color} | {color:red} hbase-server in branch-1 has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s {color} | {color:green} branch-1 passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s {color} | {color:green} branch-1 passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} the patch passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} the patch passed with JDK v1.7.0_80 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 55s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 14m 52s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_121 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s {color} | {color:green} the patch passed with JDK v1.7.0_80 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 88m 56s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 120m 47s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.regionserver.TestRSKilledWhenInitializing | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:e01ee2f | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856354/HBASE-17718.branch-1.001.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 5903e64f5e01 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision |
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898440#comment-15898440 ] Hudson commented on HBASE-17718: SUCCESS: Integrated in Jenkins build HBase-Trunk_matrix #2625 (See [https://builds.apache.org/job/HBase-Trunk_matrix/2625/]) HBASE-17718 Difference between RS's servername and its ephemeral node (stack: rev 7fa7156f2cfbc6a5c9a50739d0ea4a3d5ce4c6ce) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestMasterProcedureEvents.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSKilledWhenInitializing.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestServerCrashProcedure.java > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: HBASE-17718.branch-1.001.patch, > HBASE-17718.master.001.patch, HBASE-17718.master.002.patch, > HBASE-17718.master.003.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898089#comment-15898089 ] Hadoop QA commented on HBASE-17718: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 50s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 53s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 31m 8s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 106m 52s {color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 152m 33s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:8d52d23 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856326/HBASE-17718.master.003.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 50a544df6f78 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / d2349c6 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/5968/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/5968/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments:
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897764#comment-15897764 ] stack commented on HBASE-17718: --- .003 includes fix for Jerry comment and fix for failing tests (added loads of comments around how the wait on regionserver checkins is supposed to work... it can get confusing... hopefully didn't make it more so). > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch, HBASE-17718.master.003.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897535#comment-15897535 ] stack commented on HBASE-17718: --- Thank you for review [~jerryhe]. Let me fix the rb comment and look at this test too... > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896022#comment-15896022 ] Jerry He commented on HBASE-17718: -- LGTM on patch 002. Minor comment on RB. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895992#comment-15895992 ] Hadoop QA commented on HBASE-17718: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 41s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 26m 48s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 96m 59s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 137m 33s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.regionserver.TestRSKilledWhenInitializing | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:8d52d23 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856077/HBASE-17718.master.002.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 9bcb30be3104 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / 6bb5938 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/5954/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/5954/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/5954/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/5954/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > Difference between RS's servername and its ephemeral node cause SSH stop > working >
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895905#comment-15895905 ] stack commented on HBASE-17718: --- .002 fixes test failures. Patch exposed sloppyness in our accounting around Master startup waiting on regions to report-in. We were passing out the wait threshold early but before-the-patch, things would 'work' because of early registration by RS up in zk. Now fix math so if a Master that only hosts hbase:meta region, add one so we wait on a RS to checkin to which we can assign user-space regions. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: HBASE-17718.master.001.patch, > HBASE-17718.master.002.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895628#comment-15895628 ] Hadoop QA commented on HBASE-17718: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 14s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 49s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 46s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 28m 7s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 107m 32s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 148m 25s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.regionserver.TestEncryptionKeyRotation | | Timed out junit tests | org.apache.hadoop.hbase.master.TestMasterOperationsForRegionReplicas | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:8d52d23 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12856042/HBASE-17718.master.001.patch | | JIRA Issue | HBASE-17718 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 6f3b90262e99 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 404a288 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | whitespace | https://builds.apache.org/job/PreCommit-HBASE-Build/5949/artifact/patchprocess/whitespace-eol.txt | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/5949/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/5949/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/5949/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output |
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895590#comment-15895590 ] stack commented on HBASE-17718: --- First cut at a patch. Reverts the fix for the problem reported over in HBASE-9593 and instead addresses the reported issue with a check for ephemeral znode on ConnectionException when trying to open a region; if none found we clean up the bad online server. This is simple. Alternatives such as having a provisional state in the Master where on receipt of a confirming heartbeat or a check on zk state looking for a corresponding ephemeral znode for the reporting regionserver and then moving the state from provisionally registered to actually registered were all messy. They also take us away from actual fix which is HBASE-17733. Lets see how this patch does. branch-1 will be a bit different. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > Attachments: HBASE-17718.master.001.patch > > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893823#comment-15893823 ] stack commented on HBASE-17718: --- Looking at this now [~allan163]. Yes, revert of HBASE-9593. In the old days, there was all kinds of mess possible when RS and Master did not agree on naming so we just let master be in charge of how servers are named in a cluster. HBASE-9593 mangled this. HBASE-13753 was supposed to fix it but was just left sit. But I'd like to put in place an alternate solution for the problem reported by HBASE-9593. It is a real possibility. The heartbeat registers the server but we need the evaporation of the znode for the server to be removed from online list -- and if we fail to write the znode post the heartbeat that reports-for-duty, the removal never happens (My 'There is also something odd...' is actually incorrect on reexamination). I'll be backHBASE-9593 has a nice test that I can reuse. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893694#comment-15893694 ] Allan Yang commented on HBASE-17718: So you suggest we revert HBASE-9593, [~stack]? If so, please go ahead and help me resolve this issue. Thank you, sir! > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893632#comment-15893632 ] stack commented on HBASE-17718: --- #1 because HBASE-9593 is wrong (registering in zk with local name rather than master-provided name BEFORE we've talked to the master to get what our cluster name is supposed to be). There is also something odd about HBASE-9593 looking at it now again. A regionserver is being 'registered' via a reading of the zk data, not via heartbeat so it must be something like a master that has joined a cluster that is already up after a master crash. It looks like an extremely rare case where an ephemeral node has not evaporated yet and in the meantime a master crashes and then a backup master joins the cluster. Looks like we need a more rigorous accounting of cluster servers when a backup master joins a running cluster. It can read candidate servers by looking in zk but it should wait on a heartbeat before adding a candidate regionserver to its cluster set. We can open a new issue to do this so we prevent a version of HBASE-9593 arising again post revert. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893574#comment-15893574 ] stack commented on HBASE-17718: --- We should not be using the local name. We need to use the name the master tells us to use. Writing zk with local name is an error. Even if it there for a blink of an eye, the master might notice and get confused thinking it a legit server. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > 4.correct the zk node if master return another name( idea from Ted Yu) > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893526#comment-15893526 ] Allan Yang commented on HBASE-17718: {quote} Another option is for region server to correct the server name in znode. This can be done via zookeeper multi: drop previous znode and create new znode in one call. {quote} It is also a good choice, but I afraid delete the old one will trigger a server shutdown handling event, and the old servername will show as dead server in master's web until restart. But I will put this one as 4th solution. Let's vote on those solutions, so I can prepare a patch. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892046#comment-15892046 ] Ted Yu commented on HBASE-17718: Another option is for region server to correct the server name in znode. This can be done via zookeeper multi: drop previous znode and create new znode in one call. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891801#comment-15891801 ] Allan Yang commented on HBASE-17718: We observed this issue only recently, we don't have a good solution yet. The solution we used to quickly eliminate the malfunction online is also making sure that /etc/hosts is clean. And I also think we'd better solve this in the code level rather than rely on env config. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working
[ https://issues.apache.org/jira/browse/HBASE-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891796#comment-15891796 ] Yu Li commented on HBASE-17718: --- We also observed similar issue online and our way to resolve it was making sure no difference in /etc/hosts. But I agree that it's better to resolve it at code level. Before discussing which solution to choose, mind share which way you're using online and why? Thanks. > Difference between RS's servername and its ephemeral node cause SSH stop > working > > > Key: HBASE-17718 > URL: https://issues.apache.org/jira/browse/HBASE-17718 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.4, 1.1.8 >Reporter: Allan Yang >Assignee: Allan Yang > > After HBASE-9593, RS put up an ephemeral node in ZK before reporting for > duty. But if the hosts config (/etc/hosts) is different between master and > RS, RS's serverName can be different from the one stored the ephemeral zk > node. The email metioned in HBASE-13753 > (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E) > is exactly what happened in our production env. > But what the email didn't point out is that the difference between serverName > in RS and zk node can cause SSH stop to work. as we can see from the code in > {{RegionServerTracker}} > {code} > @Override > public void nodeDeleted(String path) { > if (path.startsWith(watcher.rsZNode)) { > String serverName = ZKUtil.getNodeName(path); > LOG.info("RegionServer ephemeral node deleted, processing expiration [" > + > serverName + "]"); > ServerName sn = ServerName.parseServerName(serverName); > if (!serverManager.isServerOnline(sn)) { > LOG.warn(serverName.toString() + " is not online or isn't known to > the master."+ > "The latter could be caused by a DNS misconfiguration."); > return; > } > remove(sn); > this.serverManager.expireServer(sn); > } > } > {code} > The server will not be processed by SSH/ServerCrashProcedure. The regions on > this server will not been assigned again until master restart or failover. > I know HBASE-9593 was to fix the issue if RS report to duty and crashed > before it can put up a zk node. It is a very rare case(And controllable, just > fix the bug making rs to crash). But The issue I metioned can happened more > often(and uncontrollable, can't be fixed in HBase, due to DNS, hosts config, > etc.) and have more severe consequence. > So here I offer some solutions to discuss: > 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in > branch-0.98 > 2. Abort RS if master return a different name, otherwise SSH can't work > properly > 3. Master accepts whatever servername reported by RS and don't change it. > -- This message was sent by Atlassian JIRA (v6.3.15#6346)