[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173025#comment-13173025 ] Hudson commented on HBASE-5063: --- Integrated in HBase-TRUNK-security #38 (See [https://builds.apache.org/job/HBase-TRUNK-security/38/]) HBASE-5063 RegionServers fail to report to backup HMaster after primary goes down stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173063#comment-13173063 ] Hudson commented on HBASE-5063: --- Integrated in HBase-TRUNK #2562 (See [https://builds.apache.org/job/HBase-TRUNK/2562/]) HBASE-5063 RegionServers fail to report to backup HMaster after primary goes down stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173498#comment-13173498 ] Hudson commented on HBASE-5063: --- Integrated in HBase-0.92-security #46 (See [https://builds.apache.org/job/HBase-0.92-security/46/]) HBASE-5063 RegionServers fail to report to backup HMaster after primary goes down stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Fix For: 0.92.0 Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172545#comment-13172545 ] Zhihong Yu commented on HBASE-5063: --- @Jonathan: What do you think of Lars' comments ? RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172599#comment-13172599 ] Jonathan Hsieh commented on HBASE-5063: --- I think it is valid and will address it. I'd like to write a unit test that captures this issue as well (it is odd that TestMasterFailover does not). RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172784#comment-13172784 ] Lars Hofhansl commented on HBASE-5063: -- @Jon: If it helps I'm happy to express my comment as a small patch. I have not thought through all the implications, though. Up to you Jon. I'll perfectly understand if you prefer to write more tests and then come up with a tested patch without distractions. :) RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172832#comment-13172832 ] Jonathan Hsieh commented on HBASE-5063: --- @Lars I got tied up with something this morning and just started looking at this again. Its will significant amount of work to make this testable so I'm going punt on making a test if this is ok. (There is a specific interleaving which I got once but can't seem to easily duplicate). RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172891#comment-13172891 ] Hadoop QA commented on HBASE-5063: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508014/hbase-5063.v2.trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestHFileBlock org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/553//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/553//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/553//console This message is automatically generated. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172892#comment-13172892 ] Lars Hofhansl commented on HBASE-5063: -- +1 on patch. Great find! I was wondering how you are going to reproduce this in a test :) RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172899#comment-13172899 ] stack commented on HBASE-5063: -- Shall I commit. First test fail is because of OOME. Second two are a numberformat prob. but seems unrelated. I'm going to commit. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172917#comment-13172917 ] Jonathan Hsieh commented on HBASE-5063: --- @Stack Here's what I got from a local run: {code} Running org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 506.988 sec Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Running org.apache.hadoop.hbase.mapred.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 501.953 sec FAILURE! Running org.apache.hadoop.hbase.io.hfile.TestHFileBlock Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 76.135 sec {code} mapreduce.TestTableMapReduce seems to be a hang. (grr... how do I get maven just to spit out all test output instead of waiting for the test to finish) mapred.TestTableMapReduce seems to be a failed MR job. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172921#comment-13172921 ] stack commented on HBASE-5063: -- Thats not your patch though Jon? Yeah, on hang, fun, fun, fun, no output. Could have maven redirect to stdout rather than file. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172929#comment-13172929 ] Zhihong Yu commented on HBASE-5063: --- I ran TestTableMapReduce and didn't get test failure, with Jon's patch. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172928#comment-13172928 ] Jonathan Hsieh commented on HBASE-5063: --- I don't think failure are due this patch. The MR ones have been failing recently so I buy that. I'd love to know the maven voodoo to make the hanging tests print/save their output... RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172930#comment-13172930 ] Zhihong Yu commented on HBASE-5063: --- I think test failures reported by Hadoop QA may have something to do with the PreCommit build machines. Waiting for Giri to increase ulimit. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch, hbase-5063.v2.0.92.patch, hbase-5063.v2.trunk.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172052#comment-13172052 ] Zhihong Yu commented on HBASE-5063: --- bq. I am wondering we shouldn't just fold I guess what you meant was 'wondering (if) we should just' RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172056#comment-13172056 ] Lars Hofhansl commented on HBASE-5063: -- Yes :) RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172002#comment-13172002 ] Zhihong Yu commented on HBASE-5063: --- TestTableMapReduce (two of them) on TRUNK run successfully on MacBook. Recently mapred.TestTableMapReduce.testMultiRegionTable showed up as failure mysteriously by Hadoop QA (not because of 'Too many open files'). We should find out why. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172045#comment-13172045 ] Lars Hofhansl commented on HBASE-5063: -- Looks like this.masterAddressManager.getMasterAddress() could return null (see first loop), so this could lead to an NPE. I am wondering we shouldn't just fold the check from the first loop (where we get masterServerName) into the 2nd loop and completely remove the first loop. I.e. if masterServerName is null, continue the loop, sleep for a bit... Means that the sleep needs to be pulled out of the try/catch. If masterServerName is not null, try to connect. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171765#comment-13171765 ] Jonathan Hsieh commented on HBASE-5063: --- Here's the exception -- unfortunately it doesn't say which master it is unable to connect to. {code} 11/12/17 18:50:24 WARN regionserver.HRegionServer: Unable to connect to master. Retrying. Error was: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1024) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:876) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) at $Proxy8.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:303) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:280) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:332) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236) at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1616) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:787) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:674) at java.lang.Thread.run(Thread.java:619) {code} RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5063) RegionServers fail to report to backup HMaster after primary goes down.
[ https://issues.apache.org/jira/browse/HBASE-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171779#comment-13171779 ] Hadoop QA commented on HBASE-5063: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12507826/HBASE-5063.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -152 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestInstantSchemaChange org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/534//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/534//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/534//console This message is automatically generated. RegionServers fail to report to backup HMaster after primary goes down. --- Key: HBASE-5063 URL: https://issues.apache.org/jira/browse/HBASE-5063 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Critical Attachments: HBASE-5063.patch # Setup cluster with two HMasters # Observe that HM1 is up and that all RS's are in the RegionServer list on web page. # Kill (not even -9) the active HMaster # Wait for ZK to time out (default 3 minutes). # Observe that HM2 is now active. Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers. Note: * If we replace a new HM1 in the same place and kill HM2, the cluster functions normally again after recovery. This sees to indicate that regionservers are stuck trying to talk to the old HM1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira