Hello, I'm running approximately 20 java processes on one host. Each process connects to zookeeper, but places very little load on zookeeper. The zookeeper cluster consists of 9 nodes.
When the zookeeper cluster is healthy, all is well. However when the zookeeper cluster goes down, the clients create significant load on the host as they attempt to reconnect to zookeeper. Each zookeeper client attempts to connect to each of the 9 nodes listed in the zookeeper cluster, in succession. If the connection fails to all hosts it will wait 1 second before trying again. So every second I've got 180 attempted connections on one host. I already had a problem with the zookeeper cluster being down, now the clients are creating excessive load as well compounding the issue. This is the code which I've narrowed it down to. Unfortunately the 1 second delay between attempts is hard coded. https://github.com/apache/zookeeper/blob/release-3.4.6/src/java/main/org/apache/zookeeper/ClientCnxn.java#L940 private void startConnect() throws IOException { state = States.CONNECTING; InetSocketAddress addr; if (rwServerAddress != null) { addr = rwServerAddress; rwServerAddress = null; } else { addr = hostProvider.next(1000); } Is the typical pattern to use a load balancer so that the client only specifies one endpoint and as a result only attempts to establish 1 connection per second? Any other recommendations? I would have thought this was a common problem, but my searches failed to find existing discussions on it. Thanks Luke -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/High-CPU-usage-on-zookeeper-clients-when-cluster-is-down-tp7580027.html Sent from the zookeeper-user mailing list archive at Nabble.com.
