https://bugzilla.wikimedia.org/show_bug.cgi?id=69667
--- Comment #2 from Andrew Otto <[email protected]> --- The core of this issue is a timeout of the Zookeeper connection, which neither Gage nor I have been able to solve. Quick summary: Kafka brokers need to maintain a live connection with Zookeeper in order to remain in the ISR. Brokers set a timeout. If a broker can't talk to Zookeeper within this timeout, it will close the connection it has and attempt to open a new one, most likely to a different Zookeeper host than it had before. Zookeeper notices when it closes this connection, and then tells all of the other brokers that this broker has left the ISR. Its leadership for any partitions is then demoted. It takes this broker usually about less than a second to reconnect to another Zookeeper host and rejoin the cluster. Thus far, when this happens, Gage or I have logged in and manually started a preferred replica election. This bring's the offending broker's leadership status back to normal. This would only be an annoyance, if it weren't for small varnishkafka hiccups this causes. Ideally, when a broker loses partition leadership, producers would be notified of the metadata change quickly enough that their buffer's don't fill up. We have noticed that for some higher volume partitions (upload and/or bits), some varnishkafkas drop messages during the short time that leadership metadata is being propagated. Action items: - Solve zookeeper timeout. Not sure how to replicate or do this right now. - Keep varnishkafka from dropping messages on metadata change. There is likely some tuning we can do to make sure we don't drop messages when partition leadership changes. - Investigate auto partition balancing. This is a new feature in Kafka 0.8.1. This would eliminate the manual step of starting a preferred replica election. This won't solve either of the above problems, but would allow the cluster to rebalance itself without manual intervention when this happens. See also: https://rt.wikimedia.org/Ticket/Display.html?id=6877 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
