[jira] [Commented] (HBASE-28428) ConnectionRegistry APIs should have timeout

2024-03-11 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825489#comment-17825489
 ] 

Viraj Jasani commented on HBASE-28428:
--

Yes, we have plans to migrate to MasterRegistry (RpcConnectionRegistry).

However, as of today, we do have zookeeper timeouts, maybe not quite aggressive.

The timeout for CompletableFuture based connection registry APIs would be very 
useful, in case somehow the client thread gets stuck due to any network or os 
level issues. The idea here is to provide timeout to Future#get for all 
connection registry APIs.

> ConnectionRegistry APIs should have timeout
> ---
>
> Key: HBASE-28428
> URL: https://issues.apache.org/jira/browse/HBASE-28428
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8
>Reporter: Viraj Jasani
>Assignee: Lokesh Khurana
>Priority: Major
>
> Came across a couple of instances where active master failover happens around 
> the same time as Zookeeper leader failover, leading to stuck HBase client if 
> one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
> ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
> usages do not have any timeouts, which can potentially lead to the entire 
> client in stuck state indefinitely as we take some global locks. For 
> instance, _getKeepAliveMasterService()_ takes
> {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ 
> gets stuck, we can block any admin operation that needs 
> {_}getKeepAliveMasterService(){_}.
>  
> Sample stacktrace that blocked all client operations that required table 
> descriptor from Admin:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
> org.apache.hadoop.hbase.client.MasterCallable.prepare
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
> org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
> org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
> org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
> org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
> org.apache.phoenix.execute.MutationState.sendBatch
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.commit
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.call.CallRunner.run
> org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
> Another similar incident is captured on PHOENIX-7233. In this case, 
> retrieving clusterId from ZNode got stuck and that blocked client from being 
> able to create any more HBase Connection. Stacktrace for referece:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createC

[jira] [Commented] (HBASE-28428) ConnectionRegistry APIs should have timeout

2024-03-11 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825280#comment-17825280
 ] 

Duo Zhang commented on HBASE-28428:
---

Oh, you are still using zookeeper based connection registry, then I think the 
problem here is you need to have timeout settings for zookeeper operations?

> ConnectionRegistry APIs should have timeout
> ---
>
> Key: HBASE-28428
> URL: https://issues.apache.org/jira/browse/HBASE-28428
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8
>Reporter: Viraj Jasani
>Assignee: Lokesh Khurana
>Priority: Major
>
> Came across a couple of instances where active master failover happens around 
> the same time as Zookeeper leader failover, leading to stuck HBase client if 
> one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
> ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
> usages do not have any timeouts, which can potentially lead to the entire 
> client in stuck state indefinitely as we take some global locks. For 
> instance, _getKeepAliveMasterService()_ takes
> {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ 
> gets stuck, we can block any admin operation that needs 
> {_}getKeepAliveMasterService(){_}.
>  
> Sample stacktrace that blocked all client operations that required table 
> descriptor from Admin:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
> org.apache.hadoop.hbase.client.MasterCallable.prepare
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
> org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
> org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
> org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
> org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
> org.apache.phoenix.execute.MutationState.sendBatch
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.commit
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.call.CallRunner.run
> org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
> Another similar incident is captured on PHOENIX-7233. In this case, 
> retrieving clusterId from ZNode got stuck and that blocked client from being 
> able to create any more HBase Connection. Stacktrace for referece:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call

[jira] [Commented] (HBASE-28428) ConnectionRegistry APIs should have timeout

2024-03-11 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825279#comment-17825279
 ] 

Duo Zhang commented on HBASE-28428:
---

I think the correct way here is to introduce seprated config for retry/timeout 
related configs for connection registry, like what we have for meta operations.

> ConnectionRegistry APIs should have timeout
> ---
>
> Key: HBASE-28428
> URL: https://issues.apache.org/jira/browse/HBASE-28428
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8
>Reporter: Viraj Jasani
>Assignee: Lokesh Khurana
>Priority: Major
>
> Came across a couple of instances where active master failover happens around 
> the same time as Zookeeper leader failover, leading to stuck HBase client if 
> one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
> ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
> usages do not have any timeouts, which can potentially lead to the entire 
> client in stuck state indefinitely as we take some global locks. For 
> instance, _getKeepAliveMasterService()_ takes
> {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ 
> gets stuck, we can block any admin operation that needs 
> {_}getKeepAliveMasterService(){_}.
>  
> Sample stacktrace that blocked all client operations that required table 
> descriptor from Admin:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
> org.apache.hadoop.hbase.client.MasterCallable.prepare
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
> org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
> org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
> org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
> org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
> org.apache.phoenix.execute.MutationState.sendBatch
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.commit
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.call.CallRunner.run
> org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
> Another similar incident is captured on PHOENIX-7233. In this case, 
> retrieving clusterId from ZNode got stuck and that blocked client from being 
> able to create any more HBase Connection. Stacktrace for referece:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.