[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect

2024-02-02 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813879#comment-17813879
 ] 

Duo Zhang commented on HBASE-28339:
---

Thanks [~andor]!

> HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to 
> reconnect
> --
>
> Key: HBASE-28339
> URL: https://issues.apache.org/jira/browse/HBASE-28339
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>
> Asbtract base class {{HBaseReplicationEndpoint}} and therefore 
> {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client 
> instance every time there's an error occurs in communication and it tries to 
> reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the 
> TGT Login thread was a static reference and only created once for all clients 
> in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is 
> dedicated to the client instance, hence we have a new login thread every time 
> the replication endpoint reconnects.
> {code:java}
> /**
>  * A private method used to re-establish a zookeeper session with a peer 
> cluster.
>  */
> protected void reconnect(KeeperException ke) {
>   if (
> ke instanceof ConnectionLossException || ke instanceof 
> SessionExpiredException
>   || ke instanceof AuthFailedException
>   ) {
> String clusterKey = ctx.getPeerConfig().getClusterKey();
> LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
> try {
>   reloadZkWatcher();
> } catch (IOException io) {
>   LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, 
> io);
> }
>   }
> }{code}
> {code:java}
> /**
>  * Closes the current ZKW (if not null) and creates a new one
>  * @throws IOException If anything goes wrong connecting
>  */
> synchronized void reloadZkWatcher() throws IOException {
>   if (zkw != null) zkw.close();
>   zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + 
> ctx.getPeerId(), this);
>   getZkw().registerListener(new PeerRegionServerListener(this));
> } {code}
> If the target cluster of replication is unavailable for some reason, the 
> replication endpoint keeps trying to reconnect to ZooKeeper destroying and 
> creating new Login threads constantly which will carpet bomb the KDC host 
> with login requests.
>  
> I'm not sure how to fix this yet, trying to create a unit test first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect

2024-02-02 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813841#comment-17813841
 ] 

Viraj Jasani commented on HBASE-28339:
--

Thank you [~andor]!!

> HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to 
> reconnect
> --
>
> Key: HBASE-28339
> URL: https://issues.apache.org/jira/browse/HBASE-28339
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>
> Asbtract base class {{HBaseReplicationEndpoint}} and therefore 
> {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client 
> instance every time there's an error occurs in communication and it tries to 
> reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the 
> TGT Login thread was a static reference and only created once for all clients 
> in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is 
> dedicated to the client instance, hence we have a new login thread every time 
> the replication endpoint reconnects.
> {code:java}
> /**
>  * A private method used to re-establish a zookeeper session with a peer 
> cluster.
>  */
> protected void reconnect(KeeperException ke) {
>   if (
> ke instanceof ConnectionLossException || ke instanceof 
> SessionExpiredException
>   || ke instanceof AuthFailedException
>   ) {
> String clusterKey = ctx.getPeerConfig().getClusterKey();
> LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
> try {
>   reloadZkWatcher();
> } catch (IOException io) {
>   LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, 
> io);
> }
>   }
> }{code}
> {code:java}
> /**
>  * Closes the current ZKW (if not null) and creates a new one
>  * @throws IOException If anything goes wrong connecting
>  */
> synchronized void reloadZkWatcher() throws IOException {
>   if (zkw != null) zkw.close();
>   zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + 
> ctx.getPeerId(), this);
>   getZkw().registerListener(new PeerRegionServerListener(this));
> } {code}
> If the target cluster of replication is unavailable for some reason, the 
> replication endpoint keeps trying to reconnect to ZooKeeper destroying and 
> creating new Login threads constantly which will carpet bomb the KDC host 
> with login requests.
>  
> I'm not sure how to fix this yet, trying to create a unit test first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect

2024-02-02 Thread Andor Molnar (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813745#comment-17813745
 ] 

Andor Molnar commented on HBASE-28339:
--

Turned out that this is an issue in ZooKeeper, not in HBase. Hbase's all 
ZooKeeper connections are affected, not just the replication.

ZK introduced the Login-thread per client feature in ZOOKEEPER-2139, but in the 
implementation it shuts down and creates a new ZooKeeperSaslClient (and Login 
thread) every time it tries to connect. HBase will wait for the answer 
indefinitely, because the ZK request timeout is set to 0 by default.

ClientCnxn.startConnect()
{code:java}
if (clientConfig.isSaslClientEnabled()) {
try {
if (zooKeeperSaslClient != null) {
zooKeeperSaslClient.shutdown();
}
zooKeeperSaslClient = new 
ZooKeeperSaslClient(SaslServerPrincipal.getServerPrincipal(addr, clientConfig),
clientConfig);
... {code}
I close this ticket and upload a fix for ZooKeeper soon.

cc [~zhangduo] 

> HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to 
> reconnect
> --
>
> Key: HBASE-28339
> URL: https://issues.apache.org/jira/browse/HBASE-28339
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>
> Asbtract base class {{HBaseReplicationEndpoint}} and therefore 
> {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client 
> instance every time there's an error occurs in communication and it tries to 
> reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the 
> TGT Login thread was a static reference and only created once for all clients 
> in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is 
> dedicated to the client instance, hence we have a new login thread every time 
> the replication endpoint reconnects.
> {code:java}
> /**
>  * A private method used to re-establish a zookeeper session with a peer 
> cluster.
>  */
> protected void reconnect(KeeperException ke) {
>   if (
> ke instanceof ConnectionLossException || ke instanceof 
> SessionExpiredException
>   || ke instanceof AuthFailedException
>   ) {
> String clusterKey = ctx.getPeerConfig().getClusterKey();
> LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
> try {
>   reloadZkWatcher();
> } catch (IOException io) {
>   LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, 
> io);
> }
>   }
> }{code}
> {code:java}
> /**
>  * Closes the current ZKW (if not null) and creates a new one
>  * @throws IOException If anything goes wrong connecting
>  */
> synchronized void reloadZkWatcher() throws IOException {
>   if (zkw != null) zkw.close();
>   zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + 
> ctx.getPeerId(), this);
>   getZkw().registerListener(new PeerRegionServerListener(this));
> } {code}
> If the target cluster of replication is unavailable for some reason, the 
> replication endpoint keeps trying to reconnect to ZooKeeper destroying and 
> creating new Login threads constantly which will carpet bomb the KDC host 
> with login requests.
>  
> I'm not sure how to fix this yet, trying to create a unit test first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect

2024-02-01 Thread Andor Molnar (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813498#comment-17813498
 ] 

Andor Molnar commented on HBASE-28339:
--

No, it doesn't seem to. Neither in ZooKeeper client unfortunately, it tries to 
connect every second, _but_ if we could re-use the same Zk client instance 
instead of creating new one, we wouldn't re-create the Login thread.

> HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to 
> reconnect
> --
>
> Key: HBASE-28339
> URL: https://issues.apache.org/jira/browse/HBASE-28339
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>
> Asbtract base class {{HBaseReplicationEndpoint}} and therefore 
> {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client 
> instance every time there's an error occurs in communication and it tries to 
> reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the 
> TGT Login thread was a static reference and only created once for all clients 
> in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is 
> dedicated to the client instance, hence we have a new login thread every time 
> the replication endpoint reconnects.
> {code:java}
> /**
>  * A private method used to re-establish a zookeeper session with a peer 
> cluster.
>  */
> protected void reconnect(KeeperException ke) {
>   if (
> ke instanceof ConnectionLossException || ke instanceof 
> SessionExpiredException
>   || ke instanceof AuthFailedException
>   ) {
> String clusterKey = ctx.getPeerConfig().getClusterKey();
> LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
> try {
>   reloadZkWatcher();
> } catch (IOException io) {
>   LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, 
> io);
> }
>   }
> }{code}
> {code:java}
> /**
>  * Closes the current ZKW (if not null) and creates a new one
>  * @throws IOException If anything goes wrong connecting
>  */
> synchronized void reloadZkWatcher() throws IOException {
>   if (zkw != null) zkw.close();
>   zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + 
> ctx.getPeerId(), this);
>   getZkw().registerListener(new PeerRegionServerListener(this));
> } {code}
> If the target cluster of replication is unavailable for some reason, the 
> replication endpoint keeps trying to reconnect to ZooKeeper destroying and 
> creating new Login threads constantly which will carpet bomb the KDC host 
> with login requests.
>  
> I'm not sure how to fix this yet, trying to create a unit test first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect

2024-02-01 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813446#comment-17813446
 ] 

Duo Zhang commented on HBASE-28339:
---

We do not have backoff when retrying here?

> HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to 
> reconnect
> --
>
> Key: HBASE-28339
> URL: https://issues.apache.org/jira/browse/HBASE-28339
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>
> Asbtract base class {{HBaseReplicationEndpoint}} and therefore 
> {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client 
> instance every time there's an error occurs in communication and it tries to 
> reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the 
> TGT Login thread was a static reference and only created once for all clients 
> in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is 
> dedicated to the client instance, hence we have a new login thread every time 
> the replication endpoint reconnects.
> {code:java}
> /**
>  * A private method used to re-establish a zookeeper session with a peer 
> cluster.
>  */
> protected void reconnect(KeeperException ke) {
>   if (
> ke instanceof ConnectionLossException || ke instanceof 
> SessionExpiredException
>   || ke instanceof AuthFailedException
>   ) {
> String clusterKey = ctx.getPeerConfig().getClusterKey();
> LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
> try {
>   reloadZkWatcher();
> } catch (IOException io) {
>   LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, 
> io);
> }
>   }
> }{code}
> {code:java}
> /**
>  * Closes the current ZKW (if not null) and creates a new one
>  * @throws IOException If anything goes wrong connecting
>  */
> synchronized void reloadZkWatcher() throws IOException {
>   if (zkw != null) zkw.close();
>   zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + 
> ctx.getPeerId(), this);
>   getZkw().registerListener(new PeerRegionServerListener(this));
> } {code}
> If the target cluster of replication is unavailable for some reason, the 
> replication endpoint keeps trying to reconnect to ZooKeeper destroying and 
> creating new Login threads constantly which will carpet bomb the KDC host 
> with login requests.
>  
> I'm not sure how to fix this yet, trying to create a unit test first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)