[jira] [Comment Edited] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED
[ https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774622#comment-16774622 ] Ankit Singhal edited comment on HBASE-21796 at 2/22/19 12:34 AM: - {quote}bq. need to look more closely at when AUTH_FAILED would be thrown by the ZK server. {quote} probably in our case, AUTH_FAILED is thrown when zk was not able to communicate with KDC. {noformat} client.ZooKeeperSaslClient: An error: (java.security .PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection refused (Connection refused))]) occur red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.{noformat} {quote}.003 adds some extra configuration properties ({{hbase.zookeeper.authfailed.retries.number}} default: 15, {{hbase.zookeeper.authfailed.pause}} default: 100) {quote} Can we use RetryCounter instead of using {noformat} + observedAuthFailed++; + if (observedAuthFailed > allowedAuthFailedRetries) { + throw new RuntimeException("Exceeded the configured retries for handling ZooKeeper" + + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")"); + } + // Avoid a fast retry loop. + if (LOG.isTraceEnabled()) { + LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper object after" + + " AUTH_FAILED state"); + } + TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat} was (Author: an...@apache.org): need to look more closely at when AUTH_FAILED would be thrown by the ZK server. probably in our case, AUTH_FAILED is thrown when zk was not able to communicate with KDC. {noformat} client.ZooKeeperSaslClient: An error: (java.security .PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection refused (Connection refused))]) occur red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.{noformat} {quote}.003 adds some extra configuration properties ({{hbase.zookeeper.authfailed.retries.number}} default: 15, {{hbase.zookeeper.authfailed.pause}} default: 100) {quote} Can we use RetryCounter instead of using {noformat} + observedAuthFailed++; + if (observedAuthFailed > allowedAuthFailedRetries) { + throw new RuntimeException("Exceeded the configured retries for handling ZooKeeper" + + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")"); + } + // Avoid a fast retry loop. + if (LOG.isTraceEnabled()) { + LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper object after" + + " AUTH_FAILED state"); + } + TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat} > RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED > --- > > Key: HBASE-21796 > URL: https://issues.apache.org/jira/browse/HBASE-21796 > Project: HBase > Issue Type: Bug > Components: Zookeeper >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Major > Fix For: 1.5.1 > > Attachments: HBASE-21796.001.branch-1.patch, > HBASE-21796.002.branch-1.patch, HBASE-21796.003.branch-1.patch > > > We've observed the following situation inside of a RegionServer which leaves > an HConnection in a broken state as a result of the ZooKeeper client having > received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The > result was that the HConnection used to write the secondary index updates > failed every time the client re-attempted the write but we had no outward > signs from the HConnection that there was a problem with that HConnection > instance. > ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the > {{AUTH_FAILED}} state that we must open a new ZooKeeper instance: > [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions] > When a new HConnection (or one without a cached meta location) tries to > access ZooKeeper to find meta's location or the cluster ID, this spin > indefinitely because we can never access ZooKeeper because our client is > broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to > use this HConnection within the RS), this breaks things pretty fast. > The circumstances that caused us to observe this are not an HBase (or Phoenix > or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of > networking issues on a user's system. Despite this, we can make our handling > of this situation better. > We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper > object when we need one (e.g. session
[jira] [Comment Edited] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED
[ https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774622#comment-16774622 ] Ankit Singhal edited comment on HBASE-21796 at 2/22/19 12:33 AM: - need to look more closely at when AUTH_FAILED would be thrown by the ZK server. probably in our case, AUTH_FAILED is thrown when zk was not able to communicate with KDC. {noformat} client.ZooKeeperSaslClient: An error: (java.security .PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection refused (Connection refused))]) occur red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.{noformat} {quote}.003 adds some extra configuration properties ({{hbase.zookeeper.authfailed.retries.number}} default: 15, {{hbase.zookeeper.authfailed.pause}} default: 100) {quote} Can we use RetryCounter instead of using {noformat} + observedAuthFailed++; + if (observedAuthFailed > allowedAuthFailedRetries) { + throw new RuntimeException("Exceeded the configured retries for handling ZooKeeper" + + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")"); + } + // Avoid a fast retry loop. + if (LOG.isTraceEnabled()) { + LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper object after" + + " AUTH_FAILED state"); + } + TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat} was (Author: an...@apache.org): {quote}{quote} need to look more closely at when AUTH_FAILED would be thrown by the ZK server. {quote}{quote} probably in our case, AUTH_FAILED is thrown when zk was not able to communicate with KDC. client.ZooKeeperSaslClient: An error: (java.security .PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection refused (Connection refused))]) occur red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. {quote}.003 adds some extra configuration properties ({{hbase.zookeeper.authfailed.retries.number}} default: 15, {{hbase.zookeeper.authfailed.pause}} default: 100) {quote} Can we use RetryCounter instead of using {noformat} + observedAuthFailed++; + if (observedAuthFailed > allowedAuthFailedRetries) { + throw new RuntimeException("Exceeded the configured retries for handling ZooKeeper" + + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")"); + } + // Avoid a fast retry loop. + if (LOG.isTraceEnabled()) { + LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper object after" + + " AUTH_FAILED state"); + } + TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat} > RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED > --- > > Key: HBASE-21796 > URL: https://issues.apache.org/jira/browse/HBASE-21796 > Project: HBase > Issue Type: Bug > Components: Zookeeper >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Major > Fix For: 1.5.1 > > Attachments: HBASE-21796.001.branch-1.patch, > HBASE-21796.002.branch-1.patch, HBASE-21796.003.branch-1.patch > > > We've observed the following situation inside of a RegionServer which leaves > an HConnection in a broken state as a result of the ZooKeeper client having > received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The > result was that the HConnection used to write the secondary index updates > failed every time the client re-attempted the write but we had no outward > signs from the HConnection that there was a problem with that HConnection > instance. > ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the > {{AUTH_FAILED}} state that we must open a new ZooKeeper instance: > [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions] > When a new HConnection (or one without a cached meta location) tries to > access ZooKeeper to find meta's location or the cluster ID, this spin > indefinitely because we can never access ZooKeeper because our client is > broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to > use this HConnection within the RS), this breaks things pretty fast. > The circumstances that caused us to observe this are not an HBase (or Phoenix > or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of > networking issues on a user's system. Despite this, we can make our handling > of this situation better. > We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper > object when we need one (e.g. session expired/closed). We can