[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657552#comment-13657552
 ] 

stack commented on HBASE-8537:
--

Looks great. Sweet test.  +1

Should the below not throw a youaredeadexception instead?

+LOG.info(Server serverName= + serverName +
+   rejected; we already have  + existingServer.toString() +
+   registered with same hostname and port);
+return false;



 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657571#comment-13657571
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

+1, thanks Jimmy!

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657576#comment-13657576
 ] 

Jimmy Xiang commented on HBASE-8537:


Youaredeadexception is used when the server is in the dead server list already 
(per checkIsDead).

For now, I was not sure when this will happen.  Suppose a servername is created 
by the master, so the master should already know it, or a dead master used to 
know it, in which case the start code should be old.  To me, this case only 
happens if there is some time screw?  Should we log a warning instead?
The difference from existing logic is that if the start code is larger, the new 
server will be recorded without throwing a PleaseHoldException.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657601#comment-13657601
 ] 

Hadoop QA commented on HBASE-8537:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12583211/trunk-8537_v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5680//console

This message is automatically generated.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657620#comment-13657620
 ] 

stack commented on HBASE-8537:
--

Lets go w/ the patch as is Jimmy. +1

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657766#comment-13657766
 ] 

Hudson commented on HBASE-8537:
---

Integrated in hbase-0.95 #193 (See 
[https://builds.apache.org/job/hbase-0.95/193/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482636)

 Result = SUCCESS
jxiang : 
Files : 
* 
/hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657852#comment-13657852
 ] 

Hudson commented on HBASE-8537:
---

Integrated in HBase-TRUNK #4118 (See 
[https://builds.apache.org/job/HBase-TRUNK/4118/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482635)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657855#comment-13657855
 ] 

Hudson commented on HBASE-8537:
---

Integrated in hbase-0.95-on-hadoop2 #99 (See 
[https://builds.apache.org/job/hbase-0.95-on-hadoop2/99/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482636)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657874#comment-13657874
 ] 

Hudson commented on HBASE-8537:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #530 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/530/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482635)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656152#comment-13656152
 ] 

Jimmy Xiang commented on HBASE-8537:


One side effect (not faked :) as in the movie I watched last weekend) is that 
AM tries to assign regions to that dead one, but gets zk events from the new 
one.  It confuses the AM.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656161#comment-13656161
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

Interesting, I assume you got this on trunk since you marked it affects 0.98? I 
would not expect 0.94 to have this issue.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656181#comment-13656181
 ] 

Jimmy Xiang commented on HBASE-8537:


Yes, I got this on trunk. There is no trunk in affected versions.  Should we 
leave it blank?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656187#comment-13656187
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

I think the understanding is that 0.98 is trunk at the moment, I just wanted to 
verify you really were on trunk.

So I would consider this bug a regression.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656188#comment-13656188
 ] 

Jimmy Xiang commented on HBASE-8537:


0.94 has this issue too.  However, the 0.94 AM doesn't check the rs timestamp 
at all, so it doesn't care and isn't confused.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656206#comment-13656206
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

I'm not sure [~jxiang], here's what happens when I test it locally:

{noformat}
2013-05-13 11:18:36,471 INFO org.apache.hadoop.hbase.master.ServerManager: 
Server serverName=172.21.3.117,60020,1368469116206 rejected; we already have 
172.21.3.117,60020,1368469063154 registered with same hostname and port
2013-05-13 11:18:36,471 INFO org.apache.hadoop.hbase.master.ServerManager: 
Triggering server recovery; existingServer 172.21.3.117,60020,1368469063154 
looks stale, new server:172.21.3.117,60020,1368469116206
2013-05-13 11:18:36,472 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
based on AM, current region=-ROOT-,,0.70236052 is on 
server=172.21.3.117,60020,1368469063154 server being checked: 
172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,473 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
based on AM, current region=.META.,,1.1028785192 is on 
server=172.21.3.117,60020,1368469063154 server being checked: 
172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,474 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Added=172.21.3.117,60020,1368469063154 to dead servers, submitted shutdown 
handler to be executed, root=true, meta=true
{noformat}


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656247#comment-13656247
 ] 

Jimmy Xiang commented on HBASE-8537:


[~jdcryans], the code touched by this patch is very old, and in 0.94 too.  In 
your test, the new region server instance is rejected actually, right? It 
should be fixed.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656253#comment-13656253
 ] 

Jimmy Xiang commented on HBASE-8537:


bq. It should be fixed.
I think it is fine, since the new region server will get a PleaseHoldException.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656256#comment-13656256
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

bq. In your test, the new region server instance is rejected actually, right? 
It should be fixed.

No, it does the right thing. The master figures the old region server is dead 
since it's coming back (from the dead!) so as you can see it triggers a SSH 
(ServerManager: Added=172.21.3.117,60020,1368469063154 to dead servers). This 
is the rest of the log:

{noformat}
2013-05-13 11:18:36,474 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
for 172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,477 DEBUG org.apache.hadoop.hbase.master.MasterFileSystem: 
Renamed region directory: 
file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting
2013-05-13 11:18:36,477 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
dead splitlog workers [172.21.3.117,60020,1368469063154]
2013-05-13 11:18:36,479 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
Scheduling batch of logs to split
2013-05-13 11:18:36,480 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
started splitting logs in 
[file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting]
2013-05-13 11:18:36,485 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
put up splitlog task at znode 
/hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703
2013-05-13 11:18:36,486 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
task not yet acquired 
/hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703
 ver = 0
2013-05-13 11:18:37,419 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
total tasks = 1 unassigned = 1
2013-05-13 11:18:38,420 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
total tasks = 1 unassigned = 1
2013-05-13 11:18:39,421 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
total tasks = 1 unassigned = 1
2013-05-13 11:18:39,480 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
STARTUP: Server 172.21.3.117,60020,1368469116206 came back up, removed it from 
the dead servers list
2013-05-13 11:18:39,480 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=172.21.3.117,60020,1368469116206
{noformat}

In this case I just killed -9 the region server, not the whole cluster.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656265#comment-13656265
 ] 

Jimmy Xiang commented on HBASE-8537:


I think I figured out this issue.  Master initializeZKBasedSystemTrackers after 
ServerManager is created.  During this period, if a region server is reported 
in, it will be added to the online regions and the reported issue will be 
triggered.  For your test, the region server checked in AFTER 
initializeZKBasedSystemTrackers is called, i.e. regionserver tracker started.


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656266#comment-13656266
 ] 

Jimmy Xiang commented on HBASE-8537:


The right fix is to change the initialization order, if it is doable?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656280#comment-13656280
 ] 

Jimmy Xiang commented on HBASE-8537:


Region server tracker requires a server manager.  It's too radical to change 
the initialization order.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656304#comment-13656304
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

What about calling SM.regionServerReport instead of SM.recordNewServer when 
going through the region servers in ZK? It will do the proper checks although 
we might need to add a new condition in SM.checkAlreadySameHostPort to see if 
the one we're passing is the older one (compared to my previous case that I 
posted where the new one is the newer one).

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656336#comment-13656336
 ] 

Jimmy Xiang commented on HBASE-8537:


I have considered this, which can reuse some shared logic.  However, I chose a 
relatively safer route.  OK, let me post a new patch soon.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656522#comment-13656522
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

Actually I meant regionServerStartup, not report.

I think your change works but it lifts code that currently is only used in 
ServerManager like findServerWithSameHostnamePort and does the same sort of 
check as checkAlreadySameHostPort().

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656535#comment-13656535
 ] 

Jimmy Xiang commented on HBASE-8537:


That's my concern too. But checkAlreadySameHostPort throws a 
PleaseHoldException, which I'd like not to touch, at least for now. (We could 
do some enhancement here for MTTR.)  That's why we go the current patch. 
findServerWithSameHostnamePort is a utility in ServerName, which should be fine 
to use?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656552#comment-13656552
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

bq. findServerWithSameHostnamePort is a utility in ServerName, which should be 
fine to use?

Yes, but I rather keep its usage in ServerManager, the way I see your patch is 
leaking a bit of SM functionality into HMaster. What about putting all of this 
into a new method in SM?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656673#comment-13656673
 ] 

Jimmy Xiang commented on HBASE-8537:


We can handle this a separate issue, for example, when we do the enhancement 
for MTTR as I mentioned before. For now, I think v2 is very clean.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656676#comment-13656676
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

Ok, last thing then, do you think it's possible to write a unit test to make 
sure we don't break this in the future?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

2013-05-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656824#comment-13656824
 ] 

stack commented on HBASE-8537:
--

Yeah, SM is leaking over into Master... and you have to ask SM for the list of 
online servers to call findServerWithSameHostnamePort.  A bit of copy and paste 
going on too.

Testing for this case would be awkward to rig.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira