[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect
[ https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813879#comment-17813879 ] Duo Zhang commented on HBASE-28339: --- Thanks [~andor]! > HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to > reconnect > -- > > Key: HBASE-28339 > URL: https://issues.apache.org/jira/browse/HBASE-28339 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0 >Reporter: Andor Molnar >Assignee: Andor Molnar >Priority: Major > > Asbtract base class {{HBaseReplicationEndpoint}} and therefore > {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client > instance every time there's an error occurs in communication and it tries to > reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the > TGT Login thread was a static reference and only created once for all clients > in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is > dedicated to the client instance, hence we have a new login thread every time > the replication endpoint reconnects. > {code:java} > /** > * A private method used to re-establish a zookeeper session with a peer > cluster. > */ > protected void reconnect(KeeperException ke) { > if ( > ke instanceof ConnectionLossException || ke instanceof > SessionExpiredException > || ke instanceof AuthFailedException > ) { > String clusterKey = ctx.getPeerConfig().getClusterKey(); > LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke); > try { > reloadZkWatcher(); > } catch (IOException io) { > LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, > io); > } > } > }{code} > {code:java} > /** > * Closes the current ZKW (if not null) and creates a new one > * @throws IOException If anything goes wrong connecting > */ > synchronized void reloadZkWatcher() throws IOException { > if (zkw != null) zkw.close(); > zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + > ctx.getPeerId(), this); > getZkw().registerListener(new PeerRegionServerListener(this)); > } {code} > If the target cluster of replication is unavailable for some reason, the > replication endpoint keeps trying to reconnect to ZooKeeper destroying and > creating new Login threads constantly which will carpet bomb the KDC host > with login requests. > > I'm not sure how to fix this yet, trying to create a unit test first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect
[ https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813841#comment-17813841 ] Viraj Jasani commented on HBASE-28339: -- Thank you [~andor]!! > HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to > reconnect > -- > > Key: HBASE-28339 > URL: https://issues.apache.org/jira/browse/HBASE-28339 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0 >Reporter: Andor Molnar >Assignee: Andor Molnar >Priority: Major > > Asbtract base class {{HBaseReplicationEndpoint}} and therefore > {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client > instance every time there's an error occurs in communication and it tries to > reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the > TGT Login thread was a static reference and only created once for all clients > in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is > dedicated to the client instance, hence we have a new login thread every time > the replication endpoint reconnects. > {code:java} > /** > * A private method used to re-establish a zookeeper session with a peer > cluster. > */ > protected void reconnect(KeeperException ke) { > if ( > ke instanceof ConnectionLossException || ke instanceof > SessionExpiredException > || ke instanceof AuthFailedException > ) { > String clusterKey = ctx.getPeerConfig().getClusterKey(); > LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke); > try { > reloadZkWatcher(); > } catch (IOException io) { > LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, > io); > } > } > }{code} > {code:java} > /** > * Closes the current ZKW (if not null) and creates a new one > * @throws IOException If anything goes wrong connecting > */ > synchronized void reloadZkWatcher() throws IOException { > if (zkw != null) zkw.close(); > zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + > ctx.getPeerId(), this); > getZkw().registerListener(new PeerRegionServerListener(this)); > } {code} > If the target cluster of replication is unavailable for some reason, the > replication endpoint keeps trying to reconnect to ZooKeeper destroying and > creating new Login threads constantly which will carpet bomb the KDC host > with login requests. > > I'm not sure how to fix this yet, trying to create a unit test first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect
[ https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813745#comment-17813745 ] Andor Molnar commented on HBASE-28339: -- Turned out that this is an issue in ZooKeeper, not in HBase. Hbase's all ZooKeeper connections are affected, not just the replication. ZK introduced the Login-thread per client feature in ZOOKEEPER-2139, but in the implementation it shuts down and creates a new ZooKeeperSaslClient (and Login thread) every time it tries to connect. HBase will wait for the answer indefinitely, because the ZK request timeout is set to 0 by default. ClientCnxn.startConnect() {code:java} if (clientConfig.isSaslClientEnabled()) { try { if (zooKeeperSaslClient != null) { zooKeeperSaslClient.shutdown(); } zooKeeperSaslClient = new ZooKeeperSaslClient(SaslServerPrincipal.getServerPrincipal(addr, clientConfig), clientConfig); ... {code} I close this ticket and upload a fix for ZooKeeper soon. cc [~zhangduo] > HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to > reconnect > -- > > Key: HBASE-28339 > URL: https://issues.apache.org/jira/browse/HBASE-28339 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0 >Reporter: Andor Molnar >Assignee: Andor Molnar >Priority: Major > > Asbtract base class {{HBaseReplicationEndpoint}} and therefore > {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client > instance every time there's an error occurs in communication and it tries to > reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the > TGT Login thread was a static reference and only created once for all clients > in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is > dedicated to the client instance, hence we have a new login thread every time > the replication endpoint reconnects. > {code:java} > /** > * A private method used to re-establish a zookeeper session with a peer > cluster. > */ > protected void reconnect(KeeperException ke) { > if ( > ke instanceof ConnectionLossException || ke instanceof > SessionExpiredException > || ke instanceof AuthFailedException > ) { > String clusterKey = ctx.getPeerConfig().getClusterKey(); > LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke); > try { > reloadZkWatcher(); > } catch (IOException io) { > LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, > io); > } > } > }{code} > {code:java} > /** > * Closes the current ZKW (if not null) and creates a new one > * @throws IOException If anything goes wrong connecting > */ > synchronized void reloadZkWatcher() throws IOException { > if (zkw != null) zkw.close(); > zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + > ctx.getPeerId(), this); > getZkw().registerListener(new PeerRegionServerListener(this)); > } {code} > If the target cluster of replication is unavailable for some reason, the > replication endpoint keeps trying to reconnect to ZooKeeper destroying and > creating new Login threads constantly which will carpet bomb the KDC host > with login requests. > > I'm not sure how to fix this yet, trying to create a unit test first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect
[ https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813498#comment-17813498 ] Andor Molnar commented on HBASE-28339: -- No, it doesn't seem to. Neither in ZooKeeper client unfortunately, it tries to connect every second, _but_ if we could re-use the same Zk client instance instead of creating new one, we wouldn't re-create the Login thread. > HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to > reconnect > -- > > Key: HBASE-28339 > URL: https://issues.apache.org/jira/browse/HBASE-28339 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0 >Reporter: Andor Molnar >Assignee: Andor Molnar >Priority: Major > > Asbtract base class {{HBaseReplicationEndpoint}} and therefore > {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client > instance every time there's an error occurs in communication and it tries to > reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the > TGT Login thread was a static reference and only created once for all clients > in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is > dedicated to the client instance, hence we have a new login thread every time > the replication endpoint reconnects. > {code:java} > /** > * A private method used to re-establish a zookeeper session with a peer > cluster. > */ > protected void reconnect(KeeperException ke) { > if ( > ke instanceof ConnectionLossException || ke instanceof > SessionExpiredException > || ke instanceof AuthFailedException > ) { > String clusterKey = ctx.getPeerConfig().getClusterKey(); > LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke); > try { > reloadZkWatcher(); > } catch (IOException io) { > LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, > io); > } > } > }{code} > {code:java} > /** > * Closes the current ZKW (if not null) and creates a new one > * @throws IOException If anything goes wrong connecting > */ > synchronized void reloadZkWatcher() throws IOException { > if (zkw != null) zkw.close(); > zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + > ctx.getPeerId(), this); > getZkw().registerListener(new PeerRegionServerListener(this)); > } {code} > If the target cluster of replication is unavailable for some reason, the > replication endpoint keeps trying to reconnect to ZooKeeper destroying and > creating new Login threads constantly which will carpet bomb the KDC host > with login requests. > > I'm not sure how to fix this yet, trying to create a unit test first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28339) HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect
[ https://issues.apache.org/jira/browse/HBASE-28339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813446#comment-17813446 ] Duo Zhang commented on HBASE-28339: --- We do not have backoff when retrying here? > HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to > reconnect > -- > > Key: HBASE-28339 > URL: https://issues.apache.org/jira/browse/HBASE-28339 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.6.0, 2.4.17, 3.0.0-beta-1, 2.5.7, 2.7.0 >Reporter: Andor Molnar >Assignee: Andor Molnar >Priority: Major > > Asbtract base class {{HBaseReplicationEndpoint}} and therefore > {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client > instance every time there's an error occurs in communication and it tries to > reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the > TGT Login thread was a static reference and only created once for all clients > in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is > dedicated to the client instance, hence we have a new login thread every time > the replication endpoint reconnects. > {code:java} > /** > * A private method used to re-establish a zookeeper session with a peer > cluster. > */ > protected void reconnect(KeeperException ke) { > if ( > ke instanceof ConnectionLossException || ke instanceof > SessionExpiredException > || ke instanceof AuthFailedException > ) { > String clusterKey = ctx.getPeerConfig().getClusterKey(); > LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke); > try { > reloadZkWatcher(); > } catch (IOException io) { > LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, > io); > } > } > }{code} > {code:java} > /** > * Closes the current ZKW (if not null) and creates a new one > * @throws IOException If anything goes wrong connecting > */ > synchronized void reloadZkWatcher() throws IOException { > if (zkw != null) zkw.close(); > zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + > ctx.getPeerId(), this); > getZkw().registerListener(new PeerRegionServerListener(this)); > } {code} > If the target cluster of replication is unavailable for some reason, the > replication endpoint keeps trying to reconnect to ZooKeeper destroying and > creating new Login threads constantly which will carpet bomb the KDC host > with login requests. > > I'm not sure how to fix this yet, trying to create a unit test first. -- This message was sent by Atlassian Jira (v8.20.10#820010)