I'd say that some randomness added here would help. E.g. to use 700-1300 ms instead of hard coded one second.
2014-06-19 18:14 GMT-04:00 Luke Stephenson <[email protected]>: > Hello, > > I'm running approximately 20 java processes on one host. Each process > connects to zookeeper, but places very little load on zookeeper. The > zookeeper cluster consists of 9 nodes. > > When the zookeeper cluster is healthy, all is well. However when the > zookeeper cluster goes down, the clients create significant load on the > host as they attempt to reconnect to zookeeper. > > Each zookeeper client attempts to connect to each of the 9 nodes listed in > the zookeeper cluster, in succession. If the connection fails to all hosts > it will wait 1 second before trying again. So every second I've got 180 > attempted connections on one host. I already had a problem with the > zookeeper cluster being down, now the clients are creating excessive load > as well compounding the issue. > > This is the code which I've narrowed it down to. Unfortunately the 1 > second delay between attempts is hard coded. > > https://github.com/apache/zookeeper/blob/release-3.4.6/src/java/main/org/apache/zookeeper/ClientCnxn.java#L940 > private void startConnect() throws IOException { > state = States.CONNECTING; > > InetSocketAddress addr; > if (rwServerAddress != null) { > addr = rwServerAddress; > rwServerAddress = null; > } else { > addr = hostProvider.next(1000); > } > > Is the typical pattern to use a load balancer so that the client only > specifies one endpoint and as a result only attempts to establish 1 > connection per second? Any other recommendations? > > I would have thought this was a common problem, but my searches failed to > find existing discussions on it. > > Thanks > > Luke > > PS Apologies if you have received this twice. I initially published from > nabble which appears to have failed. >
