I've pulled this into a separate branch after incorporating some feedback. https://github.com/msolo/zookeeper/commits/msolo-optimize-close-session
On Fri, Oct 14, 2016 at 12:03 AM, Mike Solomon <[email protected]> wrote: > Thanks for the comments - I'll incorporate them in a future fix. There > is actually a flaw in this code as it's currently implemented - it > does not match the original behavior and I need to think more > carefully. > > Arshad, I think ZOOKEEPER-2570 is a somewhat different issue. The > root cause in both cases is that the ProcessRequestThread is > overloaded, but large multi-op transactions are probably a degenerate > case. > > On Thu, Oct 13, 2016 at 1:12 PM, Edward Ribeiro > <[email protected]> wrote: >> Very interesting patch, Mike. >> >> I've left a couple of review comments (hope you don't mind) in the >> https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c >> 422b3c8f0c commit. :) >> >> Cheers, >> Eddie >> >> >> On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad < >> [email protected]> wrote: >> >>> Hi Mike >>> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be >>> used to quickly check performance gains in each modification. Hope it is >>> useful. >>> >>> -Arshad >>> >>> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <[email protected]> wrote: >>> >>> > I've been performance testing 3.5.2 and hit an interesting unavailability >>> > issue. >>> > >>> > When there server is very busy (64k connections, 16k writes per >>> > second) the leader can get busy enough that connections get throttled. >>> > Enough throttling causes sessions to expire. As sessions expire, the >>> > CPU consumption rises and the quorum is effectively unavailable. >>> > Interestingly, if you shut down all the clients, the quorum won't heal >>> > for nearly 10 minutes. >>> > >>> > The issue is that the outstandingChanges queue has 250k items in it >>> > and the closeSession code scans this linearly under a lock. Replacing >>> > the linear scan with a hash table lookup improves this, but likely the >>> > real solution is some backpressure on clients as a result of an >>> > oversized outstandingChanges queue. >>> > >>> > Here is a sample fix: >>> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c >>> > 422b3c8f0c >>> > >>> > This results in the quorum healing about 30 seconds after the clients >>> > disconnect. >>> > >>> > Is there a way to prevent runaway growth in this queue? I'm wondering >>> > if changing the definition of "throttling" to take into account the >>> > size of this queue might help mitigate this. The end goal is that some >>> > stable amount of traffic is reached asymptotically without suffering a >>> > collapse. >>> > >>> > Thanks, >>> > -Mike >>> > >>>
