I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
used to quickly check performance gains in each modification. Hope it is
On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:
> I've been performance testing 3.5.2 and hit an interesting unavailability
> When there server is very busy (64k connections, 16k writes per
> second) the leader can get busy enough that connections get throttled.
> Enough throttling causes sessions to expire. As sessions expire, the
> CPU consumption rises and the quorum is effectively unavailable.
> Interestingly, if you shut down all the clients, the quorum won't heal
> for nearly 10 minutes.
> The issue is that the outstandingChanges queue has 250k items in it
> and the closeSession code scans this linearly under a lock. Replacing
> the linear scan with a hash table lookup improves this, but likely the
> real solution is some backpressure on clients as a result of an
> oversized outstandingChanges queue.
> Here is a sample fix:
> This results in the quorum healing about 30 seconds after the clients
> Is there a way to prevent runaway growth in this queue? I'm wondering
> if changing the definition of "throttling" to take into account the
> size of this queue might help mitigate this. The end goal is that some
> stable amount of traffic is reached asymptotically without suffering a