[ https://issues.apache.org/jira/browse/KAFKA-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Rao resolved KAFKA-3984. ---------------------------- Resolution: Duplicate Marking this as duplicate. The fix will be done in KAFKA-5473. > Broker doesn't retry reconnecting to an expired Zookeeper connection > -------------------------------------------------------------------- > > Key: KAFKA-3984 > URL: https://issues.apache.org/jira/browse/KAFKA-3984 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.1, 0.10.1.1 > Reporter: Braedon Vickers > > We've been having issues with the network connectivity of our Kafka cluster, > and this seems to be triggering an issue where the brokers stop trying to > reconnect to Zookeeper, leaving us with a broken cluster even when the > network has recovered. > When network issues begin we see {{java.net.NoRouteToHostException}} > exceptions from {{org.apache.zookeeper.ClientCnxn}} as it attempts to > re-establish the connection. If the network issue resolves itself while we > are only getting these errors the broker seems to reconnect fine. > However, a lot of the time we end up with a message like this: > {code}[2016-07-22 00:21:44,181] FATAL Could not establish session with > zookeeper (kafka.server.KafkaHealthcheck) > org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper > hosts> > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71) > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279) > ... > Caused by: java.net.UnknownHostException: <zookeeper host> > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > ... > {code} > (apologies for the partial stack traces - I'm having to try and reconstruct > them from a less than ideal centralised logging setup.) > If this happens, the broker stops trying to reconnect to Zookeeper, and we > have to restart it. > It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state > isn't {{Expired}} it will keep retrying the connection, and will recover OK > when the network is back. However, once it changes to {{Expired}} (not > entirely sure how that happens - based on the session timeout perhaps?) > zkclient closes the existing client and attempts to create a new one. If the > network is still down, the client constructor throws a > {{java.net.UnknownHostException}}, zkclient calls > {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, > {{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error > and does nothing else. > It seems like some form of retry needs to happen here, or the broker is stuck > with no Zookeeper connection > indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to > kill the JVM, but that was removed in > https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be > better than doing nothing, as then your init system could restart it, > allowing it to recover once the network was back. > Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well. > However, it seems likely, as there doesn't seem to be any code changes in > kafka or zkclient that would affect this behaviour. -- This message was sent by Atlassian JIRA (v6.4.14#64029)