subject:"\[jira\] \[Commented\] \(HBASE\-8537\) Dead region server pulled in from ZK"


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657766#comment-13657766
 ] 

Hudson commented on HBASE-8537:
---

Integrated in hbase-0.95 #193 (See 
[https://builds.apache.org/job/hbase-0.95/193/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482636)

 Result = SUCCESS
jxiang : 
Files : 
* 
/hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657852#comment-13657852
 ] 

Hudson commented on HBASE-8537:
---

Integrated in HBase-TRUNK #4118 (See 
[https://builds.apache.org/job/HBase-TRUNK/4118/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482635)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657855#comment-13657855
 ] 

Hudson commented on HBASE-8537:
---

Integrated in hbase-0.95-on-hadoop2 #99 (See 
[https://builds.apache.org/job/hbase-0.95-on-hadoop2/99/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482636)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13657874#comment-13657874
 ] 

Hudson commented on HBASE-8537:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #530 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/530/])
HBASE-8537 Dead region server pulled in from ZK (Revision 1482635)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 0.98.0, 0.95.1

 Attachments: trunk-8537.patch, trunk-8537_v2.patch, 
 trunk-8537_v3.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656152#comment-13656152
 ] 

Jimmy Xiang commented on HBASE-8537:


One side effect (not faked :) as in the movie I watched last weekend) is that 
AM tries to assign regions to that dead one, but gets zk events from the new 
one.  It confuses the AM.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656161#comment-13656161
]

Jean-Daniel Cryans commented on HBASE-8537:
---

Interesting, I assume you got this on trunk since you marked it affects 0.98? I
would not expect 0.94 to have this issue.

Dead region server pulled in from ZK

Key: HBASE-8537
URL: https://issues.apache.org/jira/browse/HBASE-8537
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656181#comment-13656181
]

Jimmy Xiang commented on HBASE-8537:

Yes, I got this on trunk. There is no trunk in affected versions. Should we
leave it blank?

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656187#comment-13656187
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

I think the understanding is that 0.98 is trunk at the moment, I just wanted to 
verify you really were on trunk.

So I would consider this bug a regression.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656188#comment-13656188
]

Jimmy Xiang commented on HBASE-8537:

0.94 has this issue too. However, the 0.94 AM doesn't check the rs timestamp
at all, so it doesn't care and isn't confused.

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656206#comment-13656206
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

I'm not sure [~jxiang], here's what happens when I test it locally:

{noformat}
2013-05-13 11:18:36,471 INFO org.apache.hadoop.hbase.master.ServerManager: 
Server serverName=172.21.3.117,60020,1368469116206 rejected; we already have 
172.21.3.117,60020,1368469063154 registered with same hostname and port
2013-05-13 11:18:36,471 INFO org.apache.hadoop.hbase.master.ServerManager: 
Triggering server recovery; existingServer 172.21.3.117,60020,1368469063154 
looks stale, new server:172.21.3.117,60020,1368469116206
2013-05-13 11:18:36,472 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
based on AM, current region=-ROOT-,,0.70236052 is on 
server=172.21.3.117,60020,1368469063154 server being checked: 
172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,473 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
based on AM, current region=.META.,,1.1028785192 is on 
server=172.21.3.117,60020,1368469063154 server being checked: 
172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,474 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
Added=172.21.3.117,60020,1368469063154 to dead servers, submitted shutdown 
handler to be executed, root=true, meta=true
{noformat}


 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656247#comment-13656247
]

Jimmy Xiang commented on HBASE-8537:

[~jdcryans], the code touched by this patch is very old, and in 0.94 too. In
your test, the new region server instance is rejected actually, right? It
should be fixed.

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656253#comment-13656253
]

Jimmy Xiang commented on HBASE-8537:

bq. It should be fixed.
I think it is fine, since the new region server will get a PleaseHoldException.

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656256#comment-13656256
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

bq. In your test, the new region server instance is rejected actually, right? 
It should be fixed.

No, it does the right thing. The master figures the old region server is dead 
since it's coming back (from the dead!) so as you can see it triggers a SSH 
(ServerManager: Added=172.21.3.117,60020,1368469063154 to dead servers). This 
is the rest of the log:

{noformat}
2013-05-13 11:18:36,474 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
for 172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,477 DEBUG org.apache.hadoop.hbase.master.MasterFileSystem: 
Renamed region directory: 
file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting
2013-05-13 11:18:36,477 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
dead splitlog workers [172.21.3.117,60020,1368469063154]
2013-05-13 11:18:36,479 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
Scheduling batch of logs to split
2013-05-13 11:18:36,480 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
started splitting logs in 
[file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting]
2013-05-13 11:18:36,485 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
put up splitlog task at znode 
/hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703
2013-05-13 11:18:36,486 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
task not yet acquired 
/hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703
 ver = 0
2013-05-13 11:18:37,419 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
total tasks = 1 unassigned = 1
2013-05-13 11:18:38,420 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
total tasks = 1 unassigned = 1
2013-05-13 11:18:39,421 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: 
total tasks = 1 unassigned = 1
2013-05-13 11:18:39,480 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
STARTUP: Server 172.21.3.117,60020,1368469116206 came back up, removed it from 
the dead servers list
2013-05-13 11:18:39,480 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=172.21.3.117,60020,1368469116206
{noformat}

In this case I just killed -9 the region server, not the whole cluster.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656265#comment-13656265
]

Jimmy Xiang commented on HBASE-8537:

I think I figured out this issue. Master initializeZKBasedSystemTrackers after
ServerManager is created. During this period, if a region server is reported
in, it will be added to the online regions and the reported issue will be
triggered. For your test, the region server checked in AFTER
initializeZKBasedSystemTrackers is called, i.e. regionserver tracker started.

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656266#comment-13656266
 ] 

Jimmy Xiang commented on HBASE-8537:


The right fix is to change the initialization order, if it is doable?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656280#comment-13656280
]

Jimmy Xiang commented on HBASE-8537:

Region server tracker requires a server manager. It's too radical to change
the initialization order.

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656304#comment-13656304
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

What about calling SM.regionServerReport instead of SM.recordNewServer when 
going through the region servers in ZK? It will do the proper checks although 
we might need to add a new condition in SM.checkAlreadySameHostPort to see if 
the one we're passing is the older one (compared to my previous case that I 
posted where the new one is the newer one).

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656336#comment-13656336
]

Jimmy Xiang commented on HBASE-8537:

I have considered this, which can reuse some shared logic. However, I chose a
relatively safer route. OK, let me post a new patch soon.

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656522#comment-13656522
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

Actually I meant regionServerStartup, not report.

I think your change works but it lifts code that currently is only used in 
ServerManager like findServerWithSameHostnamePort and does the same sort of 
check as checkAlreadySameHostPort().

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK

[
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656535#comment-13656535
]

Jimmy Xiang commented on HBASE-8537:

That's my concern too. But checkAlreadySameHostPort throws a
PleaseHoldException, which I'd like not to touch, at least for now. (We could
do some enhancement here for MTTR.) That's why we go the current patch.
findServerWithSameHostnamePort is a utility in ServerName, which should be fine
to use?

Dead region server pulled in from ZK

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656552#comment-13656552
 ] 

Jean-Daniel Cryans commented on HBASE-8537:
---

bq. findServerWithSameHostnamePort is a utility in ServerName, which should be 
fine to use?

Yes, but I rather keep its usage in ServerManager, the way I see your patch is 
leaking a bit of SM functionality into HMaster. What about putting all of this 
into a new method in SM?

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK


[ 
https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656673#comment-13656673
 ] 

Jimmy Xiang commented on HBASE-8537:


We can handle this a separate issue, for example, when we do the enhancement 
for MTTR as I mentioned before. For now, I think v2 is very clean.

 Dead region server pulled in from ZK
 

 Key: HBASE-8537
 URL: https://issues.apache.org/jira/browse/HBASE-8537
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-8537.patch, trunk-8537_v2.patch


 When a cluster restarts quickly after it's crashed, although a new region 
 server is reported in, the master still pulls in the dead region server from 
 the zk.
 {noformat}
 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368408767773
 
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.HMaster: Registering server found up in zk but 
 who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
 2013-05-12 18:32:54,653 INFO  
 [master-a1220.halxg.cloudera.com,36000,1368408767520] 
 org.apache.hadoop.hbase.master.ServerManager: Registering 
 server=a1217.halxg.cloudera.com,36020,1368378273768
 {noformat}
 We should not pull in the second region server instance from zk.  It is 
 actually dead.  We can figure this out by the hostname, and the port.  We can 
 assume no two region server instances can be alive on the same host, the same 
 port.  To be more cautious, we can check the timestamp as well.  The live one 
 should be that with the late timestamp, not pulled in from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK