Very interesting patch, Mike. I've left a couple of review comments (hope you don't mind) in the https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c 422b3c8f0c commit. :)
Cheers, Eddie On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad < arshad.mohamma...@gmail.com> wrote: > Hi Mike > I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be > used to quickly check performance gains in each modification. Hope it is > useful. > > -Arshad > > On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote: > > > I've been performance testing 3.5.2 and hit an interesting unavailability > > issue. > > > > When there server is very busy (64k connections, 16k writes per > > second) the leader can get busy enough that connections get throttled. > > Enough throttling causes sessions to expire. As sessions expire, the > > CPU consumption rises and the quorum is effectively unavailable. > > Interestingly, if you shut down all the clients, the quorum won't heal > > for nearly 10 minutes. > > > > The issue is that the outstandingChanges queue has 250k items in it > > and the closeSession code scans this linearly under a lock. Replacing > > the linear scan with a hash table lookup improves this, but likely the > > real solution is some backpressure on clients as a result of an > > oversized outstandingChanges queue. > > > > Here is a sample fix: > > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c > > 422b3c8f0c > > > > This results in the quorum healing about 30 seconds after the clients > > disconnect. > > > > Is there a way to prevent runaway growth in this queue? I'm wondering > > if changing the definition of "throttling" to take into account the > > size of this queue might help mitigate this. The end goal is that some > > stable amount of traffic is reached asymptotically without suffering a > > collapse. > > > > Thanks, > > -Mike > > >