[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-20 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173025#comment-13173025
 ] 

Hudson commented on HBASE-5063:
---

Integrated in HBase-TRUNK-security #38 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/38/])
HBASE-5063 RegionServers fail to report to backup HMaster after primary 
goes down

stack : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Fix For: 0.92.0

 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-20 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173063#comment-13173063
 ] 

Hudson commented on HBASE-5063:
---

Integrated in HBase-TRUNK #2562 (See 
[https://builds.apache.org/job/HBase-TRUNK/2562/])
HBASE-5063 RegionServers fail to report to backup HMaster after primary 
goes down

stack : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Fix For: 0.92.0

 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-20 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173498#comment-13173498
 ] 

Hudson commented on HBASE-5063:
---

Integrated in HBase-0.92-security #46 (See 
[https://builds.apache.org/job/HBase-0.92-security/46/])
HBASE-5063 RegionServers fail to report to backup HMaster after primary 
goes down

stack : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Fix For: 0.92.0

 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172545#comment-13172545
 ] 

Zhihong Yu commented on HBASE-5063:
---

@Jonathan:
What do you think of Lars' comments ?

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172599#comment-13172599
 ] 

Jonathan Hsieh commented on HBASE-5063:
---

I think it is valid and will address it.  I'd like to write a unit test that 
captures this issue as well (it is odd that TestMasterFailover does not).



 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172784#comment-13172784
 ] 

Lars Hofhansl commented on HBASE-5063:
--

@Jon:
If it helps I'm happy to express my comment as a small patch. I have not 
thought through all the implications, though.
Up to you Jon. I'll perfectly understand if you prefer to write more tests and 
then come up with a tested patch without distractions. :)


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172832#comment-13172832
 ] 

Jonathan Hsieh commented on HBASE-5063:
---

@Lars

I got tied up with something this morning and just started looking at this 
again.  Its will significant amount of work to make this testable so I'm going 
punt on making a test if this is ok.  (There is a specific interleaving which I 
got once but can't seem to easily duplicate).



 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172891#comment-13172891
 ] 

Hadoop QA commented on HBASE-5063:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12508014/hbase-5063.v2.trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -152 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 76 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/553//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/553//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/553//console

This message is automatically generated.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172892#comment-13172892
 ] 

Lars Hofhansl commented on HBASE-5063:
--

+1 on patch. Great find!
I was wondering how you are going to reproduce this in a test :)


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172899#comment-13172899
 ] 

stack commented on HBASE-5063:
--

Shall I commit.  First test fail is because of OOME.  Second two are a 
numberformat prob. but seems unrelated.  I'm going to commit.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172917#comment-13172917
 ] 

Jonathan Hsieh commented on HBASE-5063:
---

@Stack

Here's what I got from a local run:

{code}
Running org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 506.988 sec
Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce
Running org.apache.hadoop.hbase.mapred.TestTableMapReduce
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 501.953 sec  
FAILURE!
Running org.apache.hadoop.hbase.io.hfile.TestHFileBlock
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 76.135 sec
{code}

mapreduce.TestTableMapReduce seems to be a hang.  (grr... how do I get maven 
just to spit out all test output instead of waiting for the test to finish)
mapred.TestTableMapReduce seems to be a failed MR job.


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172921#comment-13172921
 ] 

stack commented on HBASE-5063:
--

Thats not your patch though Jon?

Yeah, on hang, fun, fun, fun, no output.  Could have maven redirect to stdout 
rather than file.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172929#comment-13172929
 ] 

Zhihong Yu commented on HBASE-5063:
---

I ran TestTableMapReduce and didn't get test failure, with Jon's patch.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172928#comment-13172928
 ] 

Jonathan Hsieh commented on HBASE-5063:
---

I don't think failure are due this patch.  The MR ones have been failing 
recently so I buy that.

I'd love to know the maven voodoo to make the hanging tests print/save their 
output...

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-19 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172930#comment-13172930
 ] 

Zhihong Yu commented on HBASE-5063:
---

I think test failures reported by Hadoop QA may have something to do with the 
PreCommit build machines.
Waiting for Giri to increase ulimit.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, 
 hbase-5063.v2.trunk.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-18 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172052#comment-13172052
 ] 

Zhihong Yu commented on HBASE-5063:
---

bq. I am wondering we shouldn't just fold
I guess what you meant was 'wondering (if) we should just'

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-18 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172056#comment-13172056
 ] 

Lars Hofhansl commented on HBASE-5063:
--

Yes :)

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-18 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172002#comment-13172002
 ] 

Zhihong Yu commented on HBASE-5063:
---

TestTableMapReduce (two of them) on TRUNK run successfully on MacBook.

Recently mapred.TestTableMapReduce.testMultiRegionTable showed up as failure 
mysteriously by Hadoop QA (not because of 'Too many open files').
We should find out why.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-18 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172045#comment-13172045
 ] 

Lars Hofhansl commented on HBASE-5063:
--

Looks like this.masterAddressManager.getMasterAddress() could return null (see 
first loop), so this could lead to an NPE.

I am wondering we shouldn't just fold the check from the first loop (where we 
get masterServerName) into the 2nd loop and completely remove the first loop.
I.e. if masterServerName is null, continue the loop, sleep for a bit... Means 
that the sleep needs to be pulled out of the try/catch. If masterServerName is 
not null, try to connect.


 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-17 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171765#comment-13171765
 ] 

Jonathan Hsieh commented on HBASE-5063:
---

Here's the exception -- unfortunately it doesn't say which master it is unable 
to connect to.

{code}
11/12/17 18:50:24 WARN regionserver.HRegionServer: Unable to connect to master. 
Retrying. Error was:
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1024)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:876)
at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
at $Proxy8.getProtocolVersion(Unknown Source)
at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:303)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:280)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:332)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1616)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:787)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:674)
at java.lang.Thread.run(Thread.java:619)
{code}

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.

2011-12-17 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171779#comment-13171779
 ] 

Hadoop QA commented on HBASE-5063:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12507826/HBASE-5063.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated -152 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 76 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.client.TestInstantSchemaChange
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/534//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/534//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/534//console

This message is automatically generated.

 RegionServers fail to report to backup HMaster after primary goes down.
 ---

 Key: HBASE-5063
 URL: https://issues.apache.org/jira/browse/HBASE-5063
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh
Priority: Critical
 Attachments: HBASE-5063.patch


 # Setup cluster with two HMasters
 # Observe that HM1 is up and that all RS's are in the RegionServer list on 
 web page.
 # Kill (not even -9) the active HMaster
 # Wait for ZK to time out (default 3 minutes).
 # Observe that HM2 is now active.  Tables may show up but RegionServers never 
 report on web page.  Existing connections are fine.  New connections cannot 
 find regionservers.
 Note: 
 * If we replace a new HM1 in the same place and kill HM2, the cluster 
 functions normally again after recovery.  This sees to indicate that 
 regionservers are stuck trying to talk to the old HM1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira