https://bugzilla.wikimedia.org/show_bug.cgi?id=69667
--- Comment #7 from Andrew Otto <[email protected]> --- I have done some more sleuthing on the zookeeper timeouts today, and I want to record my thoughts. - The timeout can happen when the broker is connected to any zookeeper. - Timeouts happen more often than timeout induced leader changes. That is, a Zookeeper session timeout does necessarily mean produce errors. - When there are timeout induced leader changes, they seem to be caused by the zookeeper leader expiring a zookeeper session, rather than a broker reconnecting to zookeeper because of a timeout. That is, both zookeeper server and kafka broker (zookeeper client) set a timeout of 16 seconds. (I believe they negotiate to the lower timeout if the setting don't match). If kafka notices the timeout first, it will just close the connection and reconnect with the same session id. If the timeout happens to be noticed by zookeeper before kafka, then zookeeper expires the session. Kafka's own timeout will trigger after the session has been expired by zookeeper, and when it attempts to reconnect, it will be told it has an expired session, which causes it to have to reconnect a second time to ask for a new session. - This also seems related to timeouts occurring when the broker (usually analytics1021) is also the controller for the kafka cluster. (The Controller is the one in charge of intra-cluster things like ISRs and leaders...I think). As far as I can tell, all of the timeout induced leader changes that we see are also accompanied by (on broker) kafka.controller.KafkaController$SessionExpirationListener - [SessionExpirationListener on 21], ZK expired; shut down all controller components and try to re-elect and (on zookeeper leader) [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x46fd72a6d4243f type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/kafka/eqiad/controller Error:KeeperErrorCode = NodeExists for /kafka/eqiad/controller error messages. That is, when the timed-out broker finally is able to reconnect (with a new session id), it attempts to re-register its previous controller status with zookeeper, only to find that another broker has taken over as controller. I just went back to my email thread to the Kafka Users group and noticed that someone had replied[1] to my sleuthing about broker GCs back in July! I never saw this email response! Gah! I will look into this suggestion tomorrow. [1] http://mail-archives.apache.org/mod_mbox/kafka-users/201407.mbox/%3CCAFbh0Q2f71qgs5JDNFxkm7SSdZyYMH=zpeoxotueqfkqexq...@mail.gmail.com%3E -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
