Hi Mike I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be used to quickly check performance gains in each modification. Hope it is useful.
-Arshad On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <[email protected]> wrote: > I've been performance testing 3.5.2 and hit an interesting unavailability > issue. > > When there server is very busy (64k connections, 16k writes per > second) the leader can get busy enough that connections get throttled. > Enough throttling causes sessions to expire. As sessions expire, the > CPU consumption rises and the quorum is effectively unavailable. > Interestingly, if you shut down all the clients, the quorum won't heal > for nearly 10 minutes. > > The issue is that the outstandingChanges queue has 250k items in it > and the closeSession code scans this linearly under a lock. Replacing > the linear scan with a hash table lookup improves this, but likely the > real solution is some backpressure on clients as a result of an > oversized outstandingChanges queue. > > Here is a sample fix: > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c > 422b3c8f0c > > This results in the quorum healing about 30 seconds after the clients > disconnect. > > Is there a way to prevent runaway growth in this queue? I'm wondering > if changing the definition of "throttling" to take into account the > size of this queue might help mitigate this. The end goal is that some > stable amount of traffic is reached asymptotically without suffering a > collapse. > > Thanks, > -Mike >
