I've been performance testing 3.5.2 and hit an interesting unavailability issue.
When there server is very busy (64k connections, 16k writes per second) the leader can get busy enough that connections get throttled. Enough throttling causes sessions to expire. As sessions expire, the CPU consumption rises and the quorum is effectively unavailable. Interestingly, if you shut down all the clients, the quorum won't heal for nearly 10 minutes. The issue is that the outstandingChanges queue has 250k items in it and the closeSession code scans this linearly under a lock. Replacing the linear scan with a hash table lookup improves this, but likely the real solution is some backpressure on clients as a result of an oversized outstandingChanges queue. Here is a sample fix: https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c422b3c8f0c This results in the quorum healing about 30 seconds after the clients disconnect. Is there a way to prevent runaway growth in this queue? I'm wondering if changing the definition of "throttling" to take into account the size of this queue might help mitigate this. The end goal is that some stable amount of traffic is reached asymptotically without suffering a collapse. Thanks, -Mike