Very interesting patch, Mike.

I've left a couple of review comments (hope you don't mind) in the
https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
422b3c8f0c commit. :)

Cheers,
Eddie


On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad <
arshad.mohamma...@gmail.com> wrote:

> Hi Mike
> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
> used to quickly check  performance gains in each modification.  Hope it is
> useful.
>
> -Arshad
>
> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:
>
> > I've been performance testing 3.5.2 and hit an interesting unavailability
> > issue.
> >
> > When there server is very busy (64k connections, 16k writes per
> > second) the leader can get busy enough that connections get throttled.
> > Enough throttling causes sessions to expire. As sessions expire, the
> > CPU consumption rises and the quorum is effectively unavailable.
> > Interestingly, if you shut down all the clients, the quorum won't heal
> > for nearly 10 minutes.
> >
> > The issue is that the outstandingChanges queue has 250k items in it
> > and the closeSession code scans this linearly under a lock. Replacing
> > the linear scan with a hash table lookup improves this, but likely the
> > real solution is some backpressure on clients as a result of an
> > oversized outstandingChanges queue.
> >
> > Here is a sample fix:
> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
> > 422b3c8f0c
> >
> > This results in the quorum healing about 30 seconds after the clients
> > disconnect.
> >
> > Is there a way to prevent runaway growth in this queue? I'm wondering
> > if changing the definition of "throttling" to take into account the
> > size of this queue might help mitigate this. The end goal is that some
> > stable amount of traffic is reached asymptotically without suffering a
> > collapse.
> >
> > Thanks,
> > -Mike
> >
>

Reply via email to