I've been performance testing 3.5.2 and hit an interesting unavailability issue.

When there server is very busy (64k connections, 16k writes per
second) the leader can get busy enough that connections get throttled.
Enough throttling causes sessions to expire. As sessions expire, the
CPU consumption rises and the quorum is effectively unavailable.
Interestingly, if you shut down all the clients, the quorum won't heal
for nearly 10 minutes.

The issue is that the outstandingChanges queue has 250k items in it
and the closeSession code scans this linearly under a lock. Replacing
the linear scan with a hash table lookup improves this, but likely the
real solution is some backpressure on clients as a result of an
oversized outstandingChanges queue.

Here is a sample fix:
https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c422b3c8f0c

This results in the quorum healing about 30 seconds after the clients
disconnect.

Is there a way to prevent runaway growth in this queue? I'm wondering
if changing the definition of "throttling" to take into account the
size of this queue might help mitigate this. The end goal is that some
stable amount of traffic is reached asymptotically without suffering a
collapse.

Thanks,
-Mike

Reply via email to