[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-24 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.10.patch, 
> HIVE-21469.2.patch, HIVE-21469.3.patch, HIVE-21469.4.patch, 
> HIVE-21469.5.patch, HIVE-21469.6.patch, HIVE-21469.7.patch, 
> HIVE-21469.8.patch, HIVE-21469.9.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-24 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.10.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.10.patch, 
> HIVE-21469.2.patch, HIVE-21469.3.patch, HIVE-21469.4.patch, 
> HIVE-21469.5.patch, HIVE-21469.6.patch, HIVE-21469.7.patch, 
> HIVE-21469.8.patch, HIVE-21469.9.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-24 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch, HIVE-21469.8.patch, HIVE-21469.9.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-24 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch, HIVE-21469.8.patch, HIVE-21469.9.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-24 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.9.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch, HIVE-21469.8.patch, HIVE-21469.9.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-24 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch, HIVE-21469.8.patch, HIVE-21469.9.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.8.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch, HIVE-21469.8.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch, HIVE-21469.8.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.8.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: (was: HIVE-21469.8.patch)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.7.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-23 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-22 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.6.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, HIVE-21469.6.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-22 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, HIVE-21469.6.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-22 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, HIVE-21469.6.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-22 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

I was using {{Collections.unmodifableList()}} and it did *not* like that.

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-22 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-22 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.5.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-21 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-21 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.4.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-21 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-21 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.3.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-21 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-21 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-19 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-19 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.2.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-19 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Open  (was: Patch Available)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-18 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Status: Patch Available  (was: Open)

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked anyway.  Also, when the 
> {{InterruptedException}} is caught in the try block, the thread's interrupted 
> flag is cleared.  The flag is not reset in this code and therefore we lose 
> the fact that this thread has been interrupted.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-18 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Description: 
A lot of sins in this class to resolve:

{code:java}
  @Override
  public void setContext(HiveLockManagerCtx ctx) throws LockException {
 try {
  curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
  parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
  try{
curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + 
 parent, new byte[0]);
  } catch (Exception e) {
// ignore if the parent already exists
if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
KeeperException.Code.NODEEXISTS) {
  LOG.warn("Unexpected ZK exception when creating parent node /" + 
parent, e);
}
  }
{code}

Every time a new session is created and this {{setContext}} method is called, 
it attempts to create the root node.  I have seen that, even though the root 
node exists, an create node action is written to the ZK logs.  Check first if 
the node exists before trying to create it.

{code:java}
  try {
curatorFramework.delete().forPath(zLock.getPath());
  } catch (InterruptedException ie) {
curatorFramework.delete().forPath(zLock.getPath());
  }
{code}

There has historically been a quite a few bugs regarding leaked locks.  The 
Driver will signal the session {{Thread}} by performing an interrupt.  That 
interrupt can happen any time and it can kill a create/delete action within the 
ZK framework.  We can see one example of workaround for this.  If the ZK action 
is interrupted, simply do it again.  Well, what if it's interrupted yet again?  
The lock will be leaked.  Also, when the {{InterruptedException}} is caught in 
the try block, the thread's interrupted flag is cleared.  The flag is not reset 
in this code and therefore we lose the fact that this thread has been 
interrupted.  This flag should be preserved so that other code paths will know 
that it's time to exit.

{code:java}
if (tryNum > 1) {
  Thread.sleep(sleepTime);
}
unlockPrimitive(hiveLock, parent, curatorFramework);
break;
  } catch (Exception e) {
if (tryNum >= numRetriesForUnLock) {
  String name = ((ZooKeeperHiveLock)hiveLock).getPath();
  throw new LockException("Node " + name + " can not be deleted after " 
+ numRetriesForUnLock + " attempts.",
  e);
}
  }
{code}

... related... the sleep here may be interrupted, but we still need to delete 
the lock (again, for fear of leaking it).  This sleep should be 
uninterruptible.  If we need to get the lock deleted, and there's a problem, 
interrupting the sleep will cause the code to eventually exit and locks will be 
leaked.

It also requires a bunch more TLC.

  was:
A lot of sins in this class to resolve:

{code:java}
  @Override
  public void setContext(HiveLockManagerCtx ctx) throws LockException {
 try {
  curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
  parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
  try{
curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + 
 parent, new byte[0]);
  } catch (Exception e) {
// ignore if the parent already exists
if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
KeeperException.Code.NODEEXISTS) {
  LOG.warn("Unexpected ZK exception when creating parent node /" + 
parent, e);
}
  }
{code}

Every time a new session is created and this {{setContext}} method is called, 
it attempts to create the root node.  I have seen that, even though the root 
node exists, an create node action is written to the ZK logs.  Check first if 
the node exists before trying to create it.

{code:java}
  try {
curatorFramework.delete().forPath(zLock.getPath());
  } catch (InterruptedException ie) {
curatorFramework.delete().forPath(zLock.getPath());
  }
{code}

There has historically been a quite a few bugs regarding leaked locks.  The 
Driver will signal the session {{Thread}} by performing an interrupt.  That 
interrupt can happen any time and it can kill a create/delete action within the 
ZK framework.  We can see one example of workaround for this.  If the ZK action 
is interrupted, simply do it again.  Well, what if it's interrupted yet again?  
The lock will be leaked anyway.  Also, when the {{InterruptedException}} is 
caught in the try block, the thread's interrupted flag is cleared.  The flag is 
not reset in this code and therefore we lose the fact that this thread has been 
interrupted.

{code:java}
if (tryNum > 1) {
  Thread.sleep(sleepTime);
}
unlockPrimitive(hiveLock, parent, curatorFramework);
break;
  } catch (Exception e) {
if (tr

[jira] [Updated] (HIVE-21469) Review of ZooKeeperHiveLockManager

2019-03-18 Thread David Mollitor (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-21469:
--
Attachment: HIVE-21469.1.patch

> Review of ZooKeeperHiveLockManager
> --
>
> Key: HIVE-21469
> URL: https://issues.apache.org/jira/browse/HIVE-21469
> Project: Hive
>  Issue Type: Improvement
>  Components: Locking
>Affects Versions: 4.0.0, 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HIVE-21469.1.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>   curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>   parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>   try{
> curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>   } catch (Exception e) {
> // ignore if the parent already exists
> if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>   LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
> }
>   }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>   try {
> curatorFramework.delete().forPath(zLock.getPath());
>   } catch (InterruptedException ie) {
> curatorFramework.delete().forPath(zLock.getPath());
>   }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked anyway.  Also, when the 
> {{InterruptedException}} is caught in the try block, the thread's interrupted 
> flag is cleared.  The flag is not reset in this code and therefore we lose 
> the fact that this thread has been interrupted.
> {code:java}
> if (tryNum > 1) {
>   Thread.sleep(sleepTime);
> }
> unlockPrimitive(hiveLock, parent, curatorFramework);
> break;
>   } catch (Exception e) {
> if (tryNum >= numRetriesForUnLock) {
>   String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>   throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>   e);
> }
>   }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)