[ https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HBASE-28428: ----------------------------------- Labels: pull-request-available (was: ) > ConnectionRegistry APIs should have timeout > ------------------------------------------- > > Key: HBASE-28428 > URL: https://issues.apache.org/jira/browse/HBASE-28428 > Project: HBase > Issue Type: Improvement > Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8 > Reporter: Viraj Jasani > Assignee: Divneet Kaur > Priority: Major > Labels: pull-request-available > > Came across a couple of instances where active master failover happens around > the same time as Zookeeper leader failover, leading to stuck HBase client if > one of the threads is blocked on one of the ConnectionRegistry rpc calls. > ConnectionRegistry APIs are wrapped with CompletableFuture. However, their > usages do not have any timeouts, which can potentially lead to the entire > client in stuck state indefinitely as we take some global locks. For > instance, _getKeepAliveMasterService()_ takes > {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ > gets stuck, we can block any admin operation that needs > {_}getKeepAliveMasterService(){_}. > > Sample stacktrace that blocked all client operations that required table > descriptor from Admin: > {code:java} > jdk.internal.misc.Unsafe.park > java.util.concurrent.locks.LockSupport.park > java.util.concurrent.CompletableFuture$Signaller.block > java.util.concurrent.ForkJoinPool.managedBlock > java.util.concurrent.CompletableFuture.waitingGet > java.util.concurrent.CompletableFuture.get > org.apache.hadoop.hbase.client.ConnectionImplementation.get > org.apache.hadoop.hbase.client.ConnectionImplementation.access$? > org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries > org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub > org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService > org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster > org.apache.hadoop.hbase.client.MasterCallable.prepare > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable > org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor > org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor > org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor > org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled > org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations > org.apache.phoenix.execute.MutationState.sendBatch > org.apache.phoenix.execute.MutationState.send > org.apache.phoenix.execute.MutationState.send > org.apache.phoenix.execute.MutationState.commit > org.apache.phoenix.jdbc.PhoenixConnection$?.call > org.apache.phoenix.jdbc.PhoenixConnection$?.call > org.apache.phoenix.call.CallRunner.run > org.apache.phoenix.jdbc.PhoenixConnection.commit {code} > Another similar incident is captured on PHOENIX-7233. In this case, > retrieving clusterId from ZNode got stuck and that blocked client from being > able to create any more HBase Connection. Stacktrace for referece: > {code:java} > jdk.internal.misc.Unsafe.park > java.util.concurrent.locks.LockSupport.park > java.util.concurrent.CompletableFuture$Signaller.block > java.util.concurrent.ForkJoinPool.managedBlock > java.util.concurrent.CompletableFuture.waitingGet > java.util.concurrent.CompletableFuture.get > org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId > org.apache.hadoop.hbase.client.ConnectionImplementation.<init> > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance? > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance > java.lang.reflect.Constructor.newInstance > org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$? > org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run > java.security.AccessController.doPrivileged > javax.security.auth.Subject.doAs > org.apache.hadoop.security.UserGroupInformation.doAs > org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs > org.apache.hadoop.hbase.client.ConnectionFactory.createConnection > org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection > org.apache.phoenix.query.ConnectionQueryServicesImpl.access$? > org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call > org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call > org.apache.phoenix.util.PhoenixContextExecutor.call > org.apache.phoenix.query.ConnectionQueryServicesImpl.init > org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices > org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster > org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection > org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$? > org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get > org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$? > org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code} > We should provide configurable timeout for all ConnectionRegistry APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)