Rushabh Shah created HBASE-27957:
------------------------------------

             Summary: HConnection (and ZookeeprWatcher threads) leak in case of 
AUTH_FAILED exception.
                 Key: HBASE-27957
                 URL: https://issues.apache.org/jira/browse/HBASE-27957
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.4.17, 1.7.2
            Reporter: Rushabh Shah


Observed this in production environment running some version of 1.7 release.
Application didn't had the right keytab setup for authentication. Application 
was trying to create HConnection and zookeeper server threw AUTH_FAILED 
exception.
After few hours of application in this state, saw thousands of 
zk-event-processor thread with below stack trace.
{noformat}
"zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms 
elapsed=41794.58s tid=0x00007fd7805066d0 nid=0x1245 waiting on condition  
[0x00007fd75df01000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.18.0.102/Native Method)
        - parking to wait for  <0x00007fd9874a85e0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.18.0.102/LockSupport.java:194)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.18.0.102/AbstractQueuedSynchronizer.java:2081)
        at 
java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.18.0.102/LinkedBlockingQueue.java:433)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1054)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1114)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18.0.102/ThreadPoolExecutor.java:628)
{noformat}
{code:java|title=ConnectionManager.java|borderStyle=solid}
HConnectionImplementation(Configuration conf, boolean managed,
        ExecutorService pool, User user, String clusterId) throws IOException {
        ...
        ...
        try {
           this.registry = setupRegistry();
           retrieveClusterId();
           ...
           ...
        } catch (Throwable e) {
           // avoid leaks: registry, rpcClient, ...
           LOG.debug("connection construction failed", e);
           close();
           throw e;
         }
{code}
retrieveClusterId internally calls ZKConnectionRegistry#getClusterId
{code:java|title=ZKConnectionRegistry.java|borderStyle=solid}
  private String clusterId = null;

  @Override
  public String getClusterId() {
    if (this.clusterId != null) return this.clusterId;
    // No synchronized here, worse case we will retrieve it twice, that's
    //  not an issue.
    try (ZooKeeperKeepAliveConnection zkw = hci.getKeepAliveZooKeeperWatcher()) 
{
      this.clusterId = ZKClusterId.readClusterIdZNode(zkw);
      if (this.clusterId == null) {
        LOG.info("ClusterId read in ZooKeeper is null");
      }
    } catch (KeeperException | IOException e) {      --->  WE ARE SWALLOWING 
THIS EXCEPTION AND RETURNING NULL. 

      LOG.warn("Can't retrieve clusterId from Zookeeper", e);
    }
    return this.clusterId;
  }
{code}

ZkConnectionRegistry#getClusterId threw the following exception.(Our logging 
system trims stack traces longer than 5 lines.)
{noformat}
Cause: org.apache.zookeeper.KeeperException$AuthFailedException: 
KeeperErrorCode = AuthFailed for /hbase/hbaseid
StackTrace: 
org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213)
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285)
org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
{noformat}

We should throw KeeperException from ZKConnectionRegistry#getClusterId all the 
way back to HConnectionImplementation constructor to close all the watcher 
threads and throw the exception back to the caller.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to