Yicheng Fang created ZOOKEEPER-2899:
---------------------------------------

             Summary: Zookeeper not receiving packets after ZXID overflows
                 Key: ZOOKEEPER-2899
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection
    Affects Versions: 3.4.5
         Environment: 5 host ensemble, 1500+ client connections each, 300K+ 
nodes
OS: Ubuntu precise
JAVA 7
JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver
6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz
4 HDD 600G each 
            Reporter: Yicheng Fang


ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of Kafka 
consumers writing  consumption offsets to ZK.

We observed the issue two times within the last year. Each time after ZXID 
overflowed, ZK was not receiving packets even though leader election looked 
successful from the logs, and ZK servers were up. As a result, the whole Kafka 
system came to a halt.

As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK 
and Kafka clusters and feed them with like-production test traffic. Though not 
really able to reproduce the issue, I did see that the Kafka consumers, which 
used ZK clients, essentially DOSed the ensemble, filling up the 
`submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read 
latencies.

More details are included in the comments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to