[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644934#comment-15644934 ] Peter Vary commented on HIVE-14979: --- Thanks [~thejas] for your review! As for the drawbacks of the current solution, some of them I did think about and tired to highlight in the description of the new configuration value, some of them I did not. Thanks for pointing those later ones out. In both cases we try to provide resilience against temporary GC or network issues, and a session loss or an improper shutdown will have different effect: - GC or network issue will cause: -- Service discovery - Shutting down of the HiveServer2 instance - please correct me if I am wrong -- Query locks - Possible data corruption - Improper shutdown will cause: -- Service discovery - Clients connecting to another server until the timeout is reached - please correct me if I am wrong -- Query locks - Locks persists until the timeout is reached After this discussion I tend to agree with you that different situations call for different configurations, so the best solution would to have the provide the administrator the ability to match the specific needs. What would be the default values of the new configurations? - 20 mins for the Service discovery timeout - 3 mins for the Lock timeout If we agree on this, I would create a patch for it. Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, > HIVE-14979.5.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616277#comment-15616277 ] Thejas M Nair commented on HIVE-14979: -- The current approach of cleanup on restart relies on the fact that the restart happens on same node. In case of cloud environments, there are more frequent instances of nodes going down. In case of on-prem instances, a node having hardware failure could result in that node/ip not being available for some time. A new HS2 instances might get started on a different node with a different IP address. Also, the current approach doesn't handle the case of multiple instances of HS2 running on the same host. I think going with [persistent ephemeral|http://curator.apache.org/curator-recipes/persistent-ephemeral-node.html] nodes is better approach. That approach is also as resilient as I wish it would be, because the fact that this curator recipe exists, also shows that there is some flakiness around nodes being around when it should be. So I think we should still keep the session.timeout in order of minutes. Regarding the session timeout - Looks like the original setting for the session timeout was 10 mins, and HIVE-9119 changed it to 20 mins. In case of zookeeper service discovery, it is not a major issue if the entry in zookeeper stays around for longer. Larger timeout can provide better resilience against temporary gc or network issues. 10 mins might be still OK for this purpose. However, in case of the locks we want to wait as little as possible before cleanup, so that in case of improper shutdown, we can cleanup the entries sooner. I think we still would want it to be couple of minutes for the sake of resiliency. Since the requirements are different we could create separate config for the lock zk session timeout. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, > HIVE-14979.5.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614600#comment-15614600 ] Lefty Leverenz commented on HIVE-14979: --- +1 for the description of *hive.zookeeper.release.stale.locks* in patch 5. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, > HIVE-14979.5.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608883#comment-15608883 ] Hive QA commented on HIVE-14979: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12835359/HIVE-14979.5.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10623 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] (batchId=89) org.apache.hadoop.hive.thrift.TestHadoopAuthBridge23.testDelegationTokenSharedStore (batchId=216) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0] (batchId=164) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=164) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=164) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1823/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1823/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1823/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12835359 - PreCommit-HIVE-Build > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, > HIVE-14979.5.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595316#comment-15595316 ] Peter Vary commented on HIVE-14979: --- Hi [~sershe], I am not sure what you mean about "ride over" of locks. By my tests, if the HiveServer2 is killed by "kill -9" and restarted, the old locks remain only until their timeout expires (max. 20 min). The new HiveServer2 will have a different sessionId and will create different locks. So I think, if the session timeout is lowered to a reasonable value we might not need the patch in the end (it will not hurt to have the extra possibility, but adds complexity and another source of error) I am not absolutely sure about the LLAP because the code around it is not trivial, but I think it uses a different timeout value in LlapStatusServiceDriver.run(): {code} HiveConf.setVar(conf, HiveConf.ConfVars.HIVE_ZOOKEEPER_SESSION_TIMEOUT, (conf .getLong(CONFIG_LLAP_ZK_REGISTRY_TIMEOUT_MS, CONFIG_LLAP_ZK_REGISTRY_TIMEOUT_MS_DEFAULT) + "ms")); {code} The default value of CONFIG_LLAP_ZK_REGISTRY_TIMEOUT_MS_DEFAULT is 10s. Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593555#comment-15593555 ] Sergey Shelukhin commented on HIVE-14979: - ZK session timeout does seems excessive... not sure why it's like that. [~thejas] [~vgumashta] can you comment? In a default config, ZK probably won't even allow such a long timeout. Reading ZK docs, it does seem like session timeout would allow the locks to ride over disconnection. I wonder why e.g. LLAP registry updates so far despite this timeout value. I think we should reduce the timeout to ~3mins and commit this patch unless [~thejas] [~vgumashta] object. Also I wonder if we need to account for multi-HS2 scenarios at all. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591664#comment-15591664 ] Peter Vary commented on HIVE-14979: --- I totally agree with you [~sershe]! Here is what I know at the moment, thanks for the guys who helped out with extra info: - There is 1 configuration value for ZooKeeper timeout (HIVE_ZOOKEEPER_SESSION_TIMEOUT) used by the service discovery and the locks as well. This is set to 20 minutes by default, and might be overwritten by the ZooKeeper maxSessionTimeout value to a lower value. - If the HiveServer2 is shut down with normal methods, then it removes the ZooKeeper nodes as expected (at least I have yet to find an example to contradict this) - If the HiveServer2 dies unexpectedly then ZooKeeper correctly removes the ephemeral nodes, but only after the session timeout is reached - with default configuration it could be 20 minutes - The patch proposes a configuration option which - if enabled - at HiveServer2 startup time will remove the remaining ZooKeeper lock nodes even if the ZooKeeper session timeout is not reached. - So far I read a quiet good reason behind the large timeout (see: the comment by [~thejas], and http://stackoverflow.com/questions/14275613/concerns-about-zookeepers-lock-recipe). Session timeout is reliant on ping messages so a long GC or network congestion could cause session termination. ZooKeeper tries to ping an idle connection after 1/3 of the timeout, so the longer the timeout, the less probable to have a session terminated overzealously :). I do not know enough about the external jobs yet, but I also think the remaining jobs could be a problem. All-in-all solving them with increased timeout does not strike me like a good solution: queries in Hive could be huge and could run for hours/days, so a 20 minutes timeout still not solves the problem at all. Am I right here, or missing some important points? Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591388#comment-15591388 ] Peter Vary commented on HIVE-14979: --- [~vgumashta]: Thanks for the info about the PersistentEphemeralNodes. Those are not used for lock just as your mentioned. I was asking your opinion, because the HIVE_ZOOKEEPER_SESSION_TIMEOUT is used for every ZooKeeper connection, so for the service discovery, ldap zookeeper registry, and for the locks as well. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589471#comment-15589471 ] Sergey Shelukhin commented on HIVE-14979: - Hmm... sorry, I still don't quite understand the problem. TL;DR the patch makes sense if it is to work around some network timeouts, or ZK not deleting nodes the way we expect. Otherwise I think we need to make sure it's compatible with timeout logic and/or just use ZK expiration. TL: Do the locks in ZK already expire at some point after HS2 dies? If the locks don't expire, we should make them expire as per below ;) If they do... >From my understanding, ZK cleans up ephemeral nodes immediately when the >process goes down in normal case (based on the connection breaking), >regardless of the timeout set for session (that is more of a network timeout >and would result in nodes being cleaned up if the connection doesn't >immediately break or in other "abnormal" cases). Is the timeout we add some additional logical timeout on top of normal cleanup, so that even when HS2 dies and the connection is broken, ZK doesn't clean up the nodes for some time after the disconnect? If yes, and we set a large timeout for a reason, we should not clean them up before timeout. The reason for a large timeout could be that the locks are taken for external jobs that don't die immediately (or at all?) when HS2 dies. If yes, and we set a large timeout for no good reason (=> we believe we can clean them up during startup, as we do in the patch), we should also reduce the timeout (or remove it and use the default). > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589366#comment-15589366 ] Vaibhav Gumashta commented on HIVE-14979: - [~pvary] The locks on ZK (ZooKeeperHiveLockManager) are not stored as persistent ephemeral nodes (curator recipe). The curator recipe is used only in service discovery for HS2. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15588989#comment-15588989 ] Hive QA commented on HIVE-14979: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12834153/HIVE-14979.4.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10567 tests executed *Failed tests:* {noformat} TestBeelineWithHS2ConnectionFile - did not produce a TEST-*.xml file (likely timed out) (batchId=199) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] (batchId=27) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0] (batchId=157) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=157) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=157) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1662/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1662/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1662/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12834153 - PreCommit-HIVE-Build > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15588516#comment-15588516 ] Peter Vary commented on HIVE-14979: --- In this case it would be good to have a way to remove these "almost persistent" :) lock entries from Zookeeper in case of a catastrophic failure. Worth to mention, that (according to the documentation) the Zookeeper server could overrule the requested timeout with its own maxSessionTimeout configuration variable, which makes the usefulness of this feature very (Zookeeper and Hive) configuration dependent. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587245#comment-15587245 ] Thejas M Nair commented on HIVE-14979: -- I believe the large session timeout was set because we have seen some cases where gc pauses or temporary network issues cause persistent ephemeral nodes to go away. It didn't hurt to have the entry around a bit longer, as the client would retry connect to other nodes. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585302#comment-15585302 ] Peter Vary commented on HIVE-14979: --- Flaky tests: * HIVE-14977 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[order_null] * HIVE-14976 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] * HIVE-14937 org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] The other failures are not even flaky (50+ failed runs) So test failures are not related The questions above are still valid. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585257#comment-15585257 ] Hive QA commented on HIVE-14979: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12833912/HIVE-14979.3.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 10594 tests executed *Failed tests:* {noformat} TestBeelineWithHS2ConnectionFile - did not produce a TEST-*.xml file (likely timed out) (batchId=197) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[order_null] (batchId=18) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] (batchId=46) org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] (batchId=89) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0] (batchId=155) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=155) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=155) org.apache.hive.jdbc.authorization.TestJdbcWithSQLAuthorization.testBlackListedUdfUsage (batchId=204) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1619/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1619/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1619/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 9 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12833912 - PreCommit-HIVE-Build > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.3.patch, HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584955#comment-15584955 ] Peter Vary commented on HIVE-14979: --- [~leftylev] Thanks! I was waiting to have the final technical solution before asking for your help :) Might need another round later if we change more stuff during the review. Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584947#comment-15584947 ] Lefty Leverenz commented on HIVE-14979: --- I left some suggestions on the review board. > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584875#comment-15584875 ] Peter Vary commented on HIVE-14979: --- [~sershe], thanks for the review! I am not a zookeeper expert so feel free to correct me if I am wrong somewhere. This was my reasoning: - Ephemeral nodes kept alive until the session is alive - The session is alive until the client sending requests, or extra PING request if there is no other request. If the HiveServer2 is down and the session timeout is not yet reached, then even for the ephemeral nodes the locks will be there. In HiveConf the SessionTimeout is set to 120ms which seems pretty excessive to me, but set by HIVE-8890 (HiveServer2 dynamic service discovery: use persistent ephemeral nodes curator recipe). This means the ephemeral locks could stay there after the crash for 20 minutes. For this reason I think the administrator would need this removal tool, or we should set the timeout to a lower value. [~vgumashta], [~thejas]: Is it possible to lower the default of the HIVE_ZOOKEEPER_SESSION_TIMEOUT, or there will be performance and/or stability issues if we change this value? Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583190#comment-15583190 ] Sergey Shelukhin commented on HIVE-14979: - Hmm... cannot the nodes be made ephemeral in ZK, if we indeed want to release them when we crash? > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581715#comment-15581715 ] Peter Vary commented on HIVE-14979: --- CC: [~namit] - I have found out, that you were the one who wrote this code? Could you please chime in if you remember anything? Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization
[ https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581634#comment-15581634 ] Peter Vary commented on HIVE-14979: --- [~ashutoshc] When creating this patch I have found this code which I do not really understand (marked by "->"): {code} package org.apache.hadoop.hive.ql.lockmgr.zookeeper; [..] public class ZooKeeperHiveLockManager implements HiveLockManager { [..] private static List getLocks(HiveConf conf, HiveLockObject key, String parent, boolean verifyTablePartition, boolean fetchData) throws LockException { [..] if (fetchData) { try { data = new HiveLockObjectData(new String(curatorFramework.getData().watched().forPath(curChild))); ->data.setClientIp(clientIp); } catch (Exception e) { LOG.error("Error in getting data for " + curChild, e); // ignore error } } [..] {code} Why do we update the clientIp of every lock when fetching (reading) data from zookeeper. By any chance do you remember anything why this was needed? Seems like it is done by purpose but during my testing I haven't find any occasion when this was unset. This is set by this code which seems to me that is quiet safe, and done every time when a new lock is created: {code} private ZooKeeperHiveLock lockPrimitive(HiveLockObject key, HiveLockMode mode, boolean keepAlive, boolean parentCreated, Set conflictingLocks) throws Exception { [..] HiveLockObjectData lockData = key.getData(); lockData.setClientIp(clientIp); {code} Thanks, Peter > Removing stale Zookeeper locks at HiveServer2 initialization > > > Key: HIVE-14979 > URL: https://issues.apache.org/jira/browse/HIVE-14979 > Project: Hive > Issue Type: Improvement > Components: Locking >Reporter: Peter Vary >Assignee: Peter Vary > Attachments: HIVE-14979.patch > > > HiveServer2 could use Zookeeper to store token that indicate that particular > tables are locked with the creation of persistent Zookeeper objects. > A problem can occur when a HiveServer2 instance creates a lock on a table and > the HiveServer2 instances crashes ("Out of Memory" for example) and the locks > are not released in Zookeeper. This lock will then remain until it is > manually cleared by an admin. > There should be a way to remove stale locks at HiveServer2 initialization, > helping the admins life. -- This message was sent by Atlassian JIRA (v6.3.4#6332)