https://bugzilla.wikimedia.org/show_bug.cgi?id=69667

--- Comment #7 from Andrew Otto <[email protected]> ---
I have done some more sleuthing on the zookeeper timeouts today, and I want to
record my thoughts.

- The timeout can happen when the broker is connected to any zookeeper.

- Timeouts happen more often than timeout induced leader changes.  That is, a
Zookeeper session timeout does necessarily mean produce errors.

- When there are timeout induced leader changes, they seem to be caused by the
zookeeper leader expiring a zookeeper session, rather than a broker
reconnecting to zookeeper because of a timeout.  That is, both zookeeper server
and kafka broker (zookeeper client) set a timeout of 16 seconds.  (I believe
they negotiate to the lower timeout if the setting don't match).  If kafka
notices the timeout first, it will just close the connection and reconnect with
the same session id.  If the timeout happens to be noticed by zookeeper before
kafka, then zookeeper expires the session.   Kafka's own timeout will trigger
after the session has been expired by zookeeper, and when it attempts to
reconnect, it will be told it has an expired session, which causes it to have
to reconnect a second time to ask for a new session.

- This also seems related to timeouts occurring when the broker (usually
analytics1021) is also the controller for the kafka cluster.  (The Controller
is the one in charge of intra-cluster things like ISRs and leaders...I think). 
As far as I can tell, all of the timeout induced leader changes that we see are
also accompanied by (on broker)

  kafka.controller.KafkaController$SessionExpirationListener  -
[SessionExpirationListener on 21], ZK expired; shut down all controller
components and try to re-elect

and (on zookeeper leader)

  [ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException
when processing sessionid:0x46fd72a6d4243f type:create cxid:0x1
zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
Path:/kafka/eqiad/controller Error:KeeperErrorCode = NodeExists for
/kafka/eqiad/controller

error messages.  That is, when the timed-out broker finally is able to
reconnect (with a new session id), it attempts to re-register its previous
controller status with zookeeper, only to find that another broker has taken
over as controller.


I just went back to my email thread to the Kafka Users group and noticed that
someone had replied[1] to my sleuthing about broker GCs back in July!  I never
saw this email response!  Gah!  I will look into this suggestion tomorrow.

[1]
http://mail-archives.apache.org/mod_mbox/kafka-users/201407.mbox/%3CCAFbh0Q2f71qgs5JDNFxkm7SSdZyYMH=zpeoxotueqfkqexq...@mail.gmail.com%3E

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to