[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284327#comment-13284327 ] ramkrishna.s.vasudevan commented on HBASE-5682: --- @Lars See HBASE-6115. As we are not waiting for the root location to come up we get NPE now. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all-v4.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246904#comment-13246904 ] Hudson commented on HBASE-5682: --- Integrated in HBase-0.94-security #7 (See [https://builds.apache.org/job/HBase-0.94-security/7/]) HBASE-5682 Allow HConnectionImplementation to recover from ZK connection loss (Revision 1308596) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all-v4.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243992#comment-13243992 ] Lars Hofhansl commented on HBASE-5682: -- all-v3 is what I like to commit tomorrow if there are no objections. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244093#comment-13244093 ] Jieshan Bean commented on HBASE-5682: - Everything seems good to me. Only a minor doubt, is it necessary to close zooKeeper before set it as null? If HConnectionImplementation#managed is true, HConnectionImplementation#abort doesn't set closed to true, just calls close method. It makes sense to me:). So the retry logic introduced in HBASE-5153 seems redundant. If one want to manage the connection by himself. If the connection is aborted. We should suggest to recreate the HConnection and HTable, right? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244359#comment-13244359 ] Lars Hofhansl commented on HBASE-5682: -- Presumably close it not needed since the connection is known to be down in this case. To be save, I'll add that, and make sure it doesn't cause another hang. I think this is better than HBASE-5153, because it attempts to reconnect when the connection is needed and not when it was lost (in which case it is likely that the next retry will fail as well, leading to long hangs with no change for the caller to notice). Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244363#comment-13244363 ] Lars Hofhansl commented on HBASE-5682: -- Oh, and thanks for taking a look Jieshan :) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244367#comment-13244367 ] stack commented on HBASE-5682: -- @Nkeywal Hows' this relate to your TRUNK work (if at all)? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244420#comment-13244420 ] Lars Hofhansl commented on HBASE-5682: -- One other strangeness I found is that none of ZKUtil methods actually throw exceptions. They retry (via RecoverableZooKeeper) and then just log a message if there is a failure. This is especially annoying with ZooKeeperWatcher, because there is no way of telling whether the connection succeeded of not from the outside. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244436#comment-13244436 ] stack commented on HBASE-5682: -- Can we add an isAlive to ZKW? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1320#comment-1320 ] nkeywal commented on HBASE-5682: .bq none of ZKUtil methods actually throw exceptions From what is see on 0.96 it should, as the return is not reached: the pattern is too call keeperException, and keeperException throws an exception. {noformat} public void keeperException(KeeperException ke) throws KeeperException { LOG.error(prefix(Received unexpected KeeperException, re-throwing exception), ke); throw ke; } {noformat} Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244468#comment-13244468 ] Lars Hofhansl commented on HBASE-5682: -- Yeah, my comment was wrong. It's not generally doing that. What I do find is if the ZK quorum is down, none of getZookeeperWatcher(), masterAddressTracker.start(), and rootRegionTracker.start() actually fail. They just retry and then happily return, which is as designed, because they are asynchronous. Would be nice to have a isAlive or waitForConnect method on ZKW that would throw if the connection could not be established. The attached patch is still a vast improvement, but it could be made better (even with zk timeout set to 100ms and retries to 3, it still take 22s for ensureZookeeperTrackers to finish). Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244470#comment-13244470 ] Lars Hofhansl commented on HBASE-5682: -- Even isAlive or waitForConnect would need to rely on a timeout, so we wouldn't have won anything really. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244484#comment-13244484 ] nkeywal commented on HBASE-5682: In 0.96 this should work, with the restriction that the logic is that you can get a non working connection, that will get fixed when you try to use it. It's a different mechanism than the one for HBaseAdmin, as HBaseAdmin first check the connection. Thz ZK mechanism is more efficient (you save a remote call to check that the connection is really working), but is more complex. However it seems it does not work at the end: bq. What I saw in 0.96 is that the client was blocked for a very long time (gave up after a few minutes), even though I had set all timeouts to low values. This is also deadly in an app server setting. Might be a simple fix there, didn't dig deeper. @lars What did you exactly do? I can do the fix it on 0.96. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all-v4.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244577#comment-13244577 ] Lars Hofhansl commented on HBASE-5682: -- Let me dig into 0.96 after I get this into 0.94... Wanna cut RC1 soon. From the past comments here I see no objections to posted patch... Will commit soon. Please speak up if you disagree. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all-v4.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243874#comment-13243874 ] stack commented on HBASE-5682: -- On commit, change this '+ LOG.debug(Abort, t);' to include the passed in msg? Else, +1 on the patch. Let me ask N if he thinks TRUNK can pick up anything from this patch (maybe his keepalive should do this auto-reconnect but maybe it doesn't need it). What were you doing w/ it was taking a long time to recover? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243052#comment-13243052 ] Lars Hofhansl commented on HBASE-5682: -- v2 passes all tests locally. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243141#comment-13243141 ] Zhihong Yu commented on HBASE-5682: --- +1 on patch. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243222#comment-13243222 ] Lars Hofhansl commented on HBASE-5682: -- Thanks Ted. The last question is: Should we do this for all HConnection (not just for unmanaged ones)? It means that HConnection would be able to recover from loss of ZK connection and the abort() method would only clear out the ZK trackers and never close or abort he connection. I'd be in favor of that. @Ram and @Jieshan: Since would a more robust version of HBASE-5153, could you have a look at this? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243227#comment-13243227 ] Zhihong Yu commented on HBASE-5682: --- Application to other HConnection makes sense. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243312#comment-13243312 ] stack commented on HBASE-5682: -- This is a perversion. If we pass in a connection from outside, down in the guts, do special handling that makes the connection and zookeeper handling do reconnect. Its like we should be passing an Interface made at a higher-level of abstraction and then in the implementation, it did this fixup when connection breaks. With that out of the way, do whatever you need to make it work. Patch looks fine. How did you test. Would it be hard to make a unit test of it. A unit test would be good codifying this perversion since it will be brittle being not whats expected. I'm against changing the behavior of the default case in 0.92/0.94. I'm interested in problems you see in hbase-5153 or issues you have w/ the implementation there that being the 0.96 client. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243589#comment-13243589 ] Lars Hofhansl commented on HBASE-5682: -- perversion is hard word. :) It is just rechecking before each use whether the trackers are still usable. The timeout is handled through the HConnection's abort(). The testing I've done: # ZK down, HBase down, start a client. Then start ZK, then HBase. # ZK up, HBase down, start client. Then start HBase # both ZK and HBase up, start client, kill HBase, restart HBase # both ZK and HBase up, start client, kill ZK and HBase restart The client just create a new HTable and then tries to get some rows in a loop. In all cases the client should successfully be able to reconnect when both ZK and HBase are up. The problem I have seen in 0.94/0.92 without this patch even with managed connections is that after HConnection times out, it is unusable and even getting a new HTable does not fix the problem since behind the scenes the same HConnection is retrieved. Will think about an automated test. Do you like the version better that always does the recheck (and hence all the conditional for managed go away)? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243605#comment-13243605 ] Lars Hofhansl commented on HBASE-5682: -- The more I look at, the more I do like the patch that changes the behavior in all cases. It's simple and low risk: Just recheck the ZK trackers before they are needed. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243619#comment-13243619 ] stack commented on HBASE-5682: -- bq. The problem I have seen in 0.94/0.92 without this patch even with managed connections is that after HConnection times out, it is unusable and even getting a new HTable does not fix the problem since behind the scenes the same HConnection is retrieved. Didn't we add a check for if the connection is bad? bq. Will think about an automated test. Do you like the version better that always does the recheck (and hence all the conditional for managed go away)? How does this work in trunk? In trunk the work has been done so we don't really keep open a zk session any more. For the sake of making tests run smoother, we'll do keep alive on zk session and hold it open 5 minutes and let it go if unused. I'm +1 on making our stuff more resilient. Resusing a dud hconnection either because the connection is dead or zk session died is hard to figure. How will this change a users's perception about how this stuff is used? If your answer is that it helps in the extreme where the connection goes dead, and thats the only change a user percieves, then lets commit. But we should include a test? If you describe one, I can try help write it? You think this should go into 0.92? Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243620#comment-13243620 ] stack commented on HBASE-5682: -- I looked at the 'all' patch. Looks good to me. Am interested in how it changes API usage (if at all). Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243637#comment-13243637 ] Lars Hofhansl commented on HBASE-5682: -- I am not envisioning any API changes, just that the HConnection would no longer be ripped from under any HTables where there is a ZK connection loss. I ran all tests again, and TestReplication and TestZookeeper have some failures that are related. Looking. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.0 Attachments: 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243665#comment-13243665 ] Lars Hofhansl commented on HBASE-5682: -- bq. You think this should go into 0.92? Probably. I guess most folks have clients that they restart frequently, use thrift, or asynchhbase. But in its current form using the standard HBase client in an app server is very error prone if the HBase/ZK cluster is ever serviced without bringing the app server down in lock step. bq. Didn't we add a check for if the connection is bad? Yeah with hbase-5153 but in 0.90 only. At some point we decided the fix there wasn't good and Ram patched it up for 0.90. This should subsime HBASE-5153. I'm happy to even put this in 0.90, but that's up to Ram. bq. I'm interested in problems you see in hbase-5153 or issues you have w/ the implementation there that being the 0.96 client. What I saw in 0.96 is that the client was blocked for a very long time (gave up after a few minutes), even though I had set all timeouts to low values. This is also deadly in an app server setting. Might be a simple fix there, didn't dig deeper. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Priority: Critical Fix For: 0.94.0 Attachments: 5682-all-v2.txt, 5682-all.txt, 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243007#comment-13243007 ] Zhihong Yu commented on HBASE-5682: --- For abort(): {code} +if (managed) { + // if the connection is managed attempt to reconnect immediately + ensureZookeeperTrackers(); {code} the condition for calling ensureZookeeperTrackers() is different from other calls in the patch. Please explain. {code} +private synchronized void ensureZookeeperTrackers() throws ZooKeeperConnectionException{ {code} Please add a space before the right curly brace. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243018#comment-13243018 ] Lars Hofhansl commented on HBASE-5682: -- Thanks Ted. # the condition is different, because that is what it did before. I.e. if the connection is managed the trackers are setup only at construction and during abort in the specific case of SessionExpiredException. If the connection is unmanaged on the other hand the trackers are rechecked before they are needed and hence abort becomes a no-op for any KeeperExcepion. Hence the condition is exactly reversed. This part is the key of the patch actually. # The space wasn't there before. I actually had the space added and then removed it again :) I'll add it back. Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)
[ https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243021#comment-13243021 ] Lars Hofhansl commented on HBASE-5682: -- The gist of this change is that (1) the ZK connection is re-checked in all calls where it is needed and re-established if needed and (2) if the connection is down the client can find out quickly (by setting timeouts accordingly) and report via IOException to the calling thread. This is only done for unmanaged HConnections (those that were created with HConnectionManager.createConnection(...) and are hence not reference counted. Reference counted HConnctions are treated as before.) This is needed to safely use the HConnection is a multithreaded long-lived AppServer setting. (In my tests I found that even 0.96 needs some more work here, but that's for a different jira.) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only) -- Key: HBASE-5682 URL: https://issues.apache.org/jira/browse/HBASE-5682 Project: HBase Issue Type: Improvement Components: client Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5682-v2.txt, 5682.txt Just realized that without this HBASE-4805 is broken. I.e. there's no point keeping a persistent HConnection around if it can be rendered permanently unusable if the ZK connection is lost temporarily. Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to backport) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira