https://bugzilla.wikimedia.org/show_bug.cgi?id=69667

--- Comment #2 from Andrew Otto <[email protected]> ---
The core of this issue is a timeout of the Zookeeper connection, which neither
Gage nor I have been able to solve.

Quick summary:  Kafka brokers need to maintain a live connection with Zookeeper
in order to remain in the ISR.  Brokers set a timeout.  If a broker can't talk
to Zookeeper within this timeout, it will close the connection it has and
attempt to open a new one, most likely to a different Zookeeper host than it
had before.  Zookeeper notices when it closes this connection, and then tells
all of the other brokers that this broker has left the ISR.  Its leadership for
any partitions is then demoted.  It takes this broker usually about less than a
second to reconnect to another Zookeeper host and rejoin the cluster.  Thus
far, when this happens, Gage or I have logged in and manually started a
preferred replica election.  This bring's the offending broker's leadership
status back to normal.

This would only be an annoyance, if it weren't for small varnishkafka hiccups
this causes.  Ideally, when a broker loses partition leadership, producers
would be notified of the metadata change quickly enough that their buffer's
don't fill up.  We have noticed that for some higher volume partitions (upload
and/or bits), some varnishkafkas drop messages during the short time that
leadership metadata is being propagated.


Action items:
- Solve zookeeper timeout.
  Not sure how to replicate or do this right now.

- Keep varnishkafka from dropping messages on metadata change.
  There is likely some tuning we can do to make sure we don't
  drop messages when partition leadership changes.

- Investigate auto partition balancing.
  This is a new feature in Kafka 0.8.1.  This would eliminate
  the manual step of starting a preferred replica election.
  This won't solve either of the above problems, but would
  allow the cluster to rebalance itself without manual
  intervention when this happens.

See also:
https://rt.wikimedia.org/Ticket/Display.html?id=6877

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to