[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657552#comment-13657552 ] stack commented on HBASE-8537: -- Looks great. Sweet test. +1 Should the below not throw a youaredeadexception instead? +LOG.info(Server serverName= + serverName + + rejected; we already have + existingServer.toString() + + registered with same hostname and port); +return false; Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657571#comment-13657571 ] Jean-Daniel Cryans commented on HBASE-8537: --- +1, thanks Jimmy! Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657576#comment-13657576 ] Jimmy Xiang commented on HBASE-8537: Youaredeadexception is used when the server is in the dead server list already (per checkIsDead). For now, I was not sure when this will happen. Suppose a servername is created by the master, so the master should already know it, or a dead master used to know it, in which case the start code should be old. To me, this case only happens if there is some time screw? Should we log a warning instead? The difference from existing logic is that if the start code is larger, the new server will be recorded without throwing a PleaseHoldException. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657601#comment-13657601 ] Hadoop QA commented on HBASE-8537: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583211/trunk-8537_v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5680//console This message is automatically generated. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657620#comment-13657620 ] stack commented on HBASE-8537: -- Lets go w/ the patch as is Jimmy. +1 Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657766#comment-13657766 ] Hudson commented on HBASE-8537: --- Integrated in hbase-0.95 #193 (See [https://builds.apache.org/job/hbase-0.95/193/]) HBASE-8537 Dead region server pulled in from ZK (Revision 1482636) Result = SUCCESS jxiang : Files : * /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 0.98.0, 0.95.1 Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657852#comment-13657852 ] Hudson commented on HBASE-8537: --- Integrated in HBase-TRUNK #4118 (See [https://builds.apache.org/job/HBase-TRUNK/4118/]) HBASE-8537 Dead region server pulled in from ZK (Revision 1482635) Result = FAILURE jxiang : Files : * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 0.98.0, 0.95.1 Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657855#comment-13657855 ] Hudson commented on HBASE-8537: --- Integrated in hbase-0.95-on-hadoop2 #99 (See [https://builds.apache.org/job/hbase-0.95-on-hadoop2/99/]) HBASE-8537 Dead region server pulled in from ZK (Revision 1482636) Result = FAILURE jxiang : Files : * /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 0.98.0, 0.95.1 Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657874#comment-13657874 ] Hudson commented on HBASE-8537: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #530 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/530/]) HBASE-8537 Dead region server pulled in from ZK (Revision 1482635) Result = FAILURE jxiang : Files : * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 0.98.0, 0.95.1 Attachments: trunk-8537.patch, trunk-8537_v2.patch, trunk-8537_v3.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656152#comment-13656152 ] Jimmy Xiang commented on HBASE-8537: One side effect (not faked :) as in the movie I watched last weekend) is that AM tries to assign regions to that dead one, but gets zk events from the new one. It confuses the AM. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656161#comment-13656161 ] Jean-Daniel Cryans commented on HBASE-8537: --- Interesting, I assume you got this on trunk since you marked it affects 0.98? I would not expect 0.94 to have this issue. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656181#comment-13656181 ] Jimmy Xiang commented on HBASE-8537: Yes, I got this on trunk. There is no trunk in affected versions. Should we leave it blank? Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656187#comment-13656187 ] Jean-Daniel Cryans commented on HBASE-8537: --- I think the understanding is that 0.98 is trunk at the moment, I just wanted to verify you really were on trunk. So I would consider this bug a regression. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656188#comment-13656188 ] Jimmy Xiang commented on HBASE-8537: 0.94 has this issue too. However, the 0.94 AM doesn't check the rs timestamp at all, so it doesn't care and isn't confused. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656206#comment-13656206 ] Jean-Daniel Cryans commented on HBASE-8537: --- I'm not sure [~jxiang], here's what happens when I test it locally: {noformat} 2013-05-13 11:18:36,471 INFO org.apache.hadoop.hbase.master.ServerManager: Server serverName=172.21.3.117,60020,1368469116206 rejected; we already have 172.21.3.117,60020,1368469063154 registered with same hostname and port 2013-05-13 11:18:36,471 INFO org.apache.hadoop.hbase.master.ServerManager: Triggering server recovery; existingServer 172.21.3.117,60020,1368469063154 looks stale, new server:172.21.3.117,60020,1368469116206 2013-05-13 11:18:36,472 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current region=-ROOT-,,0.70236052 is on server=172.21.3.117,60020,1368469063154 server being checked: 172.21.3.117,60020,1368469063154 2013-05-13 11:18:36,473 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current region=.META.,,1.1028785192 is on server=172.21.3.117,60020,1368469063154 server being checked: 172.21.3.117,60020,1368469063154 2013-05-13 11:18:36,474 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=172.21.3.117,60020,1368469063154 to dead servers, submitted shutdown handler to be executed, root=true, meta=true {noformat} Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656247#comment-13656247 ] Jimmy Xiang commented on HBASE-8537: [~jdcryans], the code touched by this patch is very old, and in 0.94 too. In your test, the new region server instance is rejected actually, right? It should be fixed. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656253#comment-13656253 ] Jimmy Xiang commented on HBASE-8537: bq. It should be fixed. I think it is fine, since the new region server will get a PleaseHoldException. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656256#comment-13656256 ] Jean-Daniel Cryans commented on HBASE-8537: --- bq. In your test, the new region server instance is rejected actually, right? It should be fixed. No, it does the right thing. The master figures the old region server is dead since it's coming back (from the dead!) so as you can see it triggers a SSH (ServerManager: Added=172.21.3.117,60020,1368469063154 to dead servers). This is the rest of the log: {noformat} 2013-05-13 11:18:36,474 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for 172.21.3.117,60020,1368469063154 2013-05-13 11:18:36,477 DEBUG org.apache.hadoop.hbase.master.MasterFileSystem: Renamed region directory: file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting 2013-05-13 11:18:36,477 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [172.21.3.117,60020,1368469063154] 2013-05-13 11:18:36,479 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to split 2013-05-13 11:18:36,480 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs in [file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting] 2013-05-13 11:18:36,485 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703 2013-05-13 11:18:36,486 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired /hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703 ver = 0 2013-05-13 11:18:37,419 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 1 2013-05-13 11:18:38,420 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 1 2013-05-13 11:18:39,421 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 1 2013-05-13 11:18:39,480 DEBUG org.apache.hadoop.hbase.master.ServerManager: STARTUP: Server 172.21.3.117,60020,1368469116206 came back up, removed it from the dead servers list 2013-05-13 11:18:39,480 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=172.21.3.117,60020,1368469116206 {noformat} In this case I just killed -9 the region server, not the whole cluster. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656265#comment-13656265 ] Jimmy Xiang commented on HBASE-8537: I think I figured out this issue. Master initializeZKBasedSystemTrackers after ServerManager is created. During this period, if a region server is reported in, it will be added to the online regions and the reported issue will be triggered. For your test, the region server checked in AFTER initializeZKBasedSystemTrackers is called, i.e. regionserver tracker started. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656266#comment-13656266 ] Jimmy Xiang commented on HBASE-8537: The right fix is to change the initialization order, if it is doable? Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656280#comment-13656280 ] Jimmy Xiang commented on HBASE-8537: Region server tracker requires a server manager. It's too radical to change the initialization order. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656304#comment-13656304 ] Jean-Daniel Cryans commented on HBASE-8537: --- What about calling SM.regionServerReport instead of SM.recordNewServer when going through the region servers in ZK? It will do the proper checks although we might need to add a new condition in SM.checkAlreadySameHostPort to see if the one we're passing is the older one (compared to my previous case that I posted where the new one is the newer one). Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656336#comment-13656336 ] Jimmy Xiang commented on HBASE-8537: I have considered this, which can reuse some shared logic. However, I chose a relatively safer route. OK, let me post a new patch soon. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.98.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656522#comment-13656522 ] Jean-Daniel Cryans commented on HBASE-8537: --- Actually I meant regionServerStartup, not report. I think your change works but it lifts code that currently is only used in ServerManager like findServerWithSameHostnamePort and does the same sort of check as checkAlreadySameHostPort(). Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656535#comment-13656535 ] Jimmy Xiang commented on HBASE-8537: That's my concern too. But checkAlreadySameHostPort throws a PleaseHoldException, which I'd like not to touch, at least for now. (We could do some enhancement here for MTTR.) That's why we go the current patch. findServerWithSameHostnamePort is a utility in ServerName, which should be fine to use? Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656552#comment-13656552 ] Jean-Daniel Cryans commented on HBASE-8537: --- bq. findServerWithSameHostnamePort is a utility in ServerName, which should be fine to use? Yes, but I rather keep its usage in ServerManager, the way I see your patch is leaking a bit of SM functionality into HMaster. What about putting all of this into a new method in SM? Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656673#comment-13656673 ] Jimmy Xiang commented on HBASE-8537: We can handle this a separate issue, for example, when we do the enhancement for MTTR as I mentioned before. For now, I think v2 is very clean. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656676#comment-13656676 ] Jean-Daniel Cryans commented on HBASE-8537: --- Ok, last thing then, do you think it's possible to write a unit test to make sure we don't break this in the future? Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
[ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656824#comment-13656824 ] stack commented on HBASE-8537: -- Yeah, SM is leaking over into Master... and you have to ask SM for the list of online servers to call findServerWithSameHostnamePort. A bit of copy and paste going on too. Testing for this case would be awkward to rig. Dead region server pulled in from ZK Key: HBASE-8537 URL: https://issues.apache.org/jira/browse/HBASE-8537 Project: HBase Issue Type: Bug Components: master Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-8537.patch, trunk-8537_v2.patch When a cluster restarts quickly after it's crashed, although a new region server is reported in, the master still pulls in the dead region server from the zk. {noformat} 2013-05-12 18:32:52,996 INFO [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368408767773 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768 2013-05-12 18:32:54,653 INFO [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager: Registering server=a1217.halxg.cloudera.com,36020,1368378273768 {noformat} We should not pull in the second region server instance from zk. It is actually dead. We can figure this out by the hostname, and the port. We can assume no two region server instances can be alive on the same host, the same port. To be more cautious, we can check the timestamp as well. The live one should be that with the late timestamp, not pulled in from zk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira