[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-11-07 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644934#comment-15644934
 ] 

Peter Vary commented on HIVE-14979:
---

Thanks [~thejas] for your review!

As for the drawbacks of the current solution, some of them I did think about 
and tired to highlight in the description of the new configuration value, some 
of them I did not. Thanks for pointing those later ones out.

In both cases we try to provide resilience against temporary GC or network 
issues, and a session loss or an improper shutdown will have different effect:
- GC or network issue will cause:
-- Service discovery - Shutting down of the HiveServer2 instance - please 
correct me if I am wrong
-- Query locks - Possible data corruption
- Improper shutdown will cause:
-- Service discovery - Clients connecting to another server until the timeout 
is reached - please correct me if I am wrong
-- Query locks - Locks persists until the timeout is reached

After this discussion I tend to agree with you that different situations call 
for different configurations, so the best solution would to have the provide 
the administrator the ability to match the specific needs.
What would be the default values of the new configurations?
- 20 mins for the Service discovery timeout
- 3 mins for the Lock timeout

If we agree on this, I would create a patch for it.

Thanks,
Peter


> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, 
> HIVE-14979.5.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-28 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616277#comment-15616277
 ] 

Thejas M Nair commented on HIVE-14979:
--

The current approach of cleanup on restart relies on the fact that the restart 
happens on same node. In case of cloud environments, there are more frequent 
instances of nodes going down. In case of on-prem instances, a node having 
hardware failure could result in that node/ip not being available for some 
time. A new HS2 instances might get started on a different node with a 
different IP address. 
Also, the current approach doesn't handle the case of multiple instances of HS2 
running on the same host.

I think going with [persistent 
ephemeral|http://curator.apache.org/curator-recipes/persistent-ephemeral-node.html]
 nodes is better approach. That approach is also as resilient as I wish it 
would be, because the fact that this curator recipe exists, also shows that 
there is some flakiness around nodes being around when it should be. So I think 
we should still keep the session.timeout in order of minutes.


Regarding the session timeout -
Looks like the original setting for the session timeout was 10 mins, and 
HIVE-9119 changed it to 20 mins. 
In case of zookeeper service discovery, it is not a major issue if the entry in 
zookeeper stays around for longer. Larger timeout can provide better resilience 
against temporary gc or network issues. 10 mins might be still OK for this 
purpose.

However, in case of the locks we want to wait as little as possible before 
cleanup, so that in case of improper shutdown, we can cleanup the entries 
sooner. I think we still would want it to be couple of minutes for the sake of 
resiliency. 
Since the requirements are different we could create separate config for the 
lock zk session timeout.



> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, 
> HIVE-14979.5.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-28 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614600#comment-15614600
 ] 

Lefty Leverenz commented on HIVE-14979:
---

+1 for the description of *hive.zookeeper.release.stale.locks* in patch 5.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, 
> HIVE-14979.5.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-26 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608883#comment-15608883
 ] 

Hive QA commented on HIVE-14979:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12835359/HIVE-14979.5.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10623 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] 
(batchId=89)
org.apache.hadoop.hive.thrift.TestHadoopAuthBridge23.testDelegationTokenSharedStore
 (batchId=216)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0]
 (batchId=164)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=164)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=164)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1823/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1823/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1823/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12835359 - PreCommit-HIVE-Build

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, 
> HIVE-14979.5.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-21 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595316#comment-15595316
 ] 

Peter Vary commented on HIVE-14979:
---

Hi [~sershe],

I am not sure what you mean about "ride over" of locks. By my tests, if the 
HiveServer2 is killed by "kill -9" and restarted, the old locks remain only 
until their timeout expires (max. 20 min). The new HiveServer2 will have a 
different sessionId and will create different locks. So I think, if the session 
timeout is lowered to a reasonable value we might not need the patch in the end 
(it will not hurt to have the extra possibility, but adds complexity and 
another source of error) 

I am not absolutely sure about the LLAP because the code around it is not 
trivial, but I think it uses a different timeout value in 
LlapStatusServiceDriver.run():

{code}
HiveConf.setVar(conf, HiveConf.ConfVars.HIVE_ZOOKEEPER_SESSION_TIMEOUT, (conf
  .getLong(CONFIG_LLAP_ZK_REGISTRY_TIMEOUT_MS, 
CONFIG_LLAP_ZK_REGISTRY_TIMEOUT_MS_DEFAULT) +
  "ms"));
{code}

The default value of CONFIG_LLAP_ZK_REGISTRY_TIMEOUT_MS_DEFAULT is 10s.

Thanks,
Peter

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-20 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593555#comment-15593555
 ] 

Sergey Shelukhin commented on HIVE-14979:
-

ZK session timeout does seems excessive... not sure why it's like that. 
[~thejas] [~vgumashta] can you comment?
In a default config, ZK probably won't even allow such a long timeout.

Reading ZK docs, it does seem like session timeout would allow the locks to 
ride over disconnection. I wonder why e.g. LLAP registry updates so far despite 
this timeout value. 
I think we should reduce the timeout to ~3mins and commit this patch unless   
[~thejas] [~vgumashta] object.
Also I wonder if we need to account for multi-HS2 scenarios at all.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-20 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591664#comment-15591664
 ] 

Peter Vary commented on HIVE-14979:
---

I totally agree with you [~sershe]!

Here is what I know at the moment, thanks for the guys who helped out with 
extra info:
- There is 1 configuration value for ZooKeeper timeout 
(HIVE_ZOOKEEPER_SESSION_TIMEOUT) used by the service discovery and the locks as 
well. This is set to 20 minutes by default, and might be overwritten by the 
ZooKeeper maxSessionTimeout value to a lower value.
- If the HiveServer2 is shut down with normal methods, then it removes the 
ZooKeeper nodes as expected (at least I have yet to find an example to 
contradict this)
- If the HiveServer2 dies unexpectedly then ZooKeeper correctly removes the 
ephemeral nodes, but only after the session timeout is reached - with default 
configuration it could be 20 minutes
- The patch proposes a configuration option which - if enabled - at HiveServer2 
startup time will remove the remaining ZooKeeper lock nodes even if the 
ZooKeeper session timeout is not reached.
- So far I read a quiet good reason behind the large timeout (see: the comment 
by [~thejas], and 
http://stackoverflow.com/questions/14275613/concerns-about-zookeepers-lock-recipe).
 Session timeout is reliant on ping messages so a long GC or network congestion 
could cause session termination. ZooKeeper tries to ping an idle connection 
after 1/3 of the timeout, so the longer the timeout, the less probable to have 
a session terminated overzealously :).

I do not know enough about the external jobs yet, but I also think the 
remaining jobs could be a problem. All-in-all solving them with increased 
timeout does not strike me like a good solution: queries in Hive could be huge 
and could run for hours/days, so a 20 minutes timeout still not solves the 
problem at all. Am I right here, or missing some important points?

Thanks,
Peter

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-20 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591388#comment-15591388
 ] 

Peter Vary commented on HIVE-14979:
---

[~vgumashta]: Thanks for the info about the PersistentEphemeralNodes. Those are 
not used for lock just as your mentioned. I was asking your opinion, because 
the HIVE_ZOOKEEPER_SESSION_TIMEOUT is used for every ZooKeeper connection, so 
for the service discovery, ldap zookeeper registry, and for the locks as well.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589471#comment-15589471
 ] 

Sergey Shelukhin commented on HIVE-14979:
-

Hmm... sorry, I still don't quite understand the problem.

TL;DR the patch makes sense if it is to work around some network timeouts, or 
ZK not deleting nodes the way we expect. Otherwise I think we need to make sure 
it's compatible with timeout logic and/or just use ZK expiration.

TL:
Do the locks in ZK already expire at some point after HS2 dies? 
If the locks don't expire, we should make them expire as per below ;)
If they do...
>From my understanding, ZK cleans up ephemeral nodes immediately when the 
>process goes down in normal case (based on the connection breaking), 
>regardless of the timeout set for session (that is more of a network timeout 
>and would result in nodes being cleaned up if the connection doesn't 
>immediately break or in other "abnormal" cases). 
Is the timeout we add some additional logical timeout on top of normal cleanup, 
so that even when HS2 dies and the connection is broken, ZK doesn't clean up 
the nodes for some time after the disconnect?

If yes, and we set a large timeout for a reason, we should not clean them up 
before timeout. The reason for a large timeout could be that the locks are 
taken for external jobs that don't die immediately (or at all?) when HS2 dies.
If yes, and we set a large timeout for no good reason (=> we believe we can 
clean them up during startup, as we do in the patch), we should also reduce the 
timeout (or remove it and use the default).






> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-19 Thread Vaibhav Gumashta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589366#comment-15589366
 ] 

Vaibhav Gumashta commented on HIVE-14979:
-

[~pvary] The locks on ZK (ZooKeeperHiveLockManager) are not stored as 
persistent ephemeral nodes (curator recipe). The curator recipe is used only in 
service discovery for HS2.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15588989#comment-15588989
 ] 

Hive QA commented on HIVE-14979:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12834153/HIVE-14979.4.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10567 tests 
executed
*Failed tests:*
{noformat}
TestBeelineWithHS2ConnectionFile - did not produce a TEST-*.xml file (likely 
timed out) (batchId=199)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] 
(batchId=27)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0]
 (batchId=157)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=157)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=157)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1662/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1662/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1662/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12834153 - PreCommit-HIVE-Build

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-19 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15588516#comment-15588516
 ] 

Peter Vary commented on HIVE-14979:
---

In this case it would be good to have a way to remove these "almost persistent" 
:) lock entries from Zookeeper in case of a catastrophic failure.

Worth to mention, that (according to the documentation) the Zookeeper server 
could overrule the requested timeout with its own maxSessionTimeout 
configuration variable, which makes the usefulness of this feature very 
(Zookeeper and Hive) configuration dependent.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587245#comment-15587245
 ] 

Thejas M Nair commented on HIVE-14979:
--

I believe the large session timeout was set because we have seen some cases 
where gc pauses or temporary network issues cause persistent ephemeral nodes to 
go away.
It didn't hurt to have the entry around a bit longer, as the client would retry 
connect to other nodes.


> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-18 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585302#comment-15585302
 ] 

Peter Vary commented on HIVE-14979:
---

Flaky tests:
* HIVE-14977 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[order_null]
* HIVE-14976 
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats]
* HIVE-14937 
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk]
The other failures are not even flaky (50+ failed runs)
So test failures are not related

The questions above are still valid.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585257#comment-15585257
 ] 

Hive QA commented on HIVE-14979:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12833912/HIVE-14979.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 10594 tests 
executed
*Failed tests:*
{noformat}
TestBeelineWithHS2ConnectionFile - did not produce a TEST-*.xml file (likely 
timed out) (batchId=197)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] 
(batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[order_null] (batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] 
(batchId=46)
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] 
(batchId=89)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0]
 (batchId=155)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=155)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=155)
org.apache.hive.jdbc.authorization.TestJdbcWithSQLAuthorization.testBlackListedUdfUsage
 (batchId=204)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1619/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1619/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1619/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 9 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12833912 - PreCommit-HIVE-Build

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.3.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-18 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584955#comment-15584955
 ] 

Peter Vary commented on HIVE-14979:
---

[~leftylev] Thanks!

I was waiting to have the final technical solution before asking for your help 
:)
Might need another round later if we change more stuff during the review.

Thanks,
Peter

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584947#comment-15584947
 ] 

Lefty Leverenz commented on HIVE-14979:
---

I left some suggestions on the review board.

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-18 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584875#comment-15584875
 ] 

Peter Vary commented on HIVE-14979:
---

[~sershe], thanks for the review!

I am not a zookeeper expert so feel free to correct me if I am wrong somewhere.

This was my reasoning:
- Ephemeral nodes kept alive until the session is alive
- The session is alive until the client sending requests, or extra PING request 
if there is no other request. 

If the HiveServer2 is down and the session timeout is not yet reached, then 
even for the ephemeral nodes the locks will be there. In HiveConf the 
SessionTimeout is set to 120ms which seems pretty excessive to me, but set 
by HIVE-8890 (HiveServer2 dynamic service discovery: use persistent ephemeral 
nodes curator recipe). This means the ephemeral locks could stay there after 
the crash for 20 minutes. 

For this reason I think the administrator would need this removal tool, or we 
should set the timeout to a lower value.

[~vgumashta], [~thejas]: Is it possible to lower the default of the 
HIVE_ZOOKEEPER_SESSION_TIMEOUT, or there will be performance and/or stability 
issues if we change this value?

Thanks,
Peter



> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-17 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583190#comment-15583190
 ] 

Sergey Shelukhin commented on HIVE-14979:
-

Hmm... cannot the nodes be made ephemeral in ZK, if we indeed want to release 
them when we crash?

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-17 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581715#comment-15581715
 ] 

Peter Vary commented on HIVE-14979:
---

CC: [~namit] - I have found out, that you were the one who wrote this code? 
Could you please chime in if you remember anything?

Thanks,
Peter

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

2016-10-17 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581634#comment-15581634
 ] 

Peter Vary commented on HIVE-14979:
---

[~ashutoshc] When creating this patch I have found this code which I do not 
really understand (marked by "->"):
{code}
package org.apache.hadoop.hive.ql.lockmgr.zookeeper;
[..]
public class ZooKeeperHiveLockManager implements HiveLockManager {
[..]
  private static List getLocks(HiveConf conf,
  HiveLockObject key, String parent, boolean verifyTablePartition, boolean 
fetchData)
  throws LockException {
[..]
if (fetchData) {
  try {
data = new HiveLockObjectData(new 
String(curatorFramework.getData().watched().forPath(curChild)));
->data.setClientIp(clientIp);
  } catch (Exception e) {
LOG.error("Error in getting data for " + curChild, e);
// ignore error
  }
}
[..]
{code}

Why do we update the clientIp of every lock when fetching (reading) data from 
zookeeper. By any chance do you remember anything why this was needed? Seems 
like it is done by purpose but during my testing I haven't find any occasion 
when this was unset.

This is set by this code which seems to me that is quiet safe, and done every 
time when a new lock is created:
{code}
  private ZooKeeperHiveLock lockPrimitive(HiveLockObject key,
  HiveLockMode mode, boolean keepAlive, boolean parentCreated,
  Set conflictingLocks)
  throws Exception {
[..]
HiveLockObjectData lockData = key.getData();
lockData.setClientIp(clientIp);
{code}

Thanks,
Peter

> Removing stale Zookeeper locks at HiveServer2 initialization
> 
>
> Key: HIVE-14979
> URL: https://issues.apache.org/jira/browse/HIVE-14979
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)