[ 
https://issues.apache.org/jira/browse/KAFKA-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-3984.
----------------------------
    Resolution: Duplicate

Marking this as duplicate. The fix will be done in KAFKA-5473.

> Broker doesn't retry reconnecting to an expired Zookeeper connection
> --------------------------------------------------------------------
>
>                 Key: KAFKA-3984
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3984
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.1, 0.10.1.1
>            Reporter: Braedon Vickers
>
> We've been having issues with the network connectivity of our Kafka cluster, 
> and this seems to be triggering an issue where the brokers stop trying to 
> reconnect to Zookeeper, leaving us with a broken cluster even when the 
> network has recovered.
> When network issues begin we see {{java.net.NoRouteToHostException}} 
> exceptions from {{org.apache.zookeeper.ClientCnxn}} as it attempts to 
> re-establish the connection. If the network issue resolves itself while we 
> are only getting these errors the broker seems to reconnect fine.
> However, a lot of the time we end up with a message like this:
> {code}[2016-07-22 00:21:44,181] FATAL Could not establish session with 
> zookeeper (kafka.server.KafkaHealthcheck)
> org.I0Itec.zkclient.exception.ZkException: Unable to connect to <zookeeper 
> hosts>
>       at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71)
>       at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279)
> ...
> Caused by: java.net.UnknownHostException: <zookeeper host>
>       at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>       at 
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
>       at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> ...
> {code}
> (apologies for the partial stack traces - I'm having to try and reconstruct 
> them from a less than ideal centralised logging setup.)
> If this happens, the broker stops trying to reconnect to Zookeeper, and we 
> have to restart it.
> It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state 
> isn't {{Expired}} it will keep retrying the connection, and will recover OK 
> when the network is back. However, once it changes to {{Expired}} (not 
> entirely sure how that happens - based on the session timeout perhaps?) 
> zkclient closes the existing client and attempts to create a new one. If the 
> network is still down, the client constructor throws a 
> {{java.net.UnknownHostException}}, zkclient calls 
> {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, 
> {{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error 
> and does nothing else.
> It seems like some form of retry needs to happen here, or the broker is stuck 
> with no Zookeeper connection 
> indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to 
> kill the JVM, but that was removed in 
> https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be 
> better than doing nothing, as then your init system could restart it, 
> allowing it to recover once the network was back.
> Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well. 
> However, it seems likely, as there doesn't seem to be any code changes in 
> kafka or zkclient that would affect this behaviour.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to