Re: Zookeeper delay to reconnect

Brian Tarbox Thu, 27 Sep 2012 17:22:17 -0700

What I see is ClientCnxn.processEvent hits the bottom catch block and
immediately retries causing the spin.
I can make this happen by suspending an event thread...or randomly under
conditions that I have not pinned down yet.


Brian

On Thu, Sep 27, 2012 at 8:07 PM, Patrick Hunt <[email protected]> wrote:

> Hi Brian, well, in my proposal the default would be the current
> behavior. With the discretion of the zk operator to change, so it
> shouldn't be any worse.
>
> You've piqued my interest - a single client attempting to connect is
> responsible for bringing down the entire cluster? Could you provide
> more details?
>
> Patrick
>
> On Thu, Sep 27, 2012 at 4:58 PM, Brian Tarbox <[email protected]>
> wrote:
> > I would lobby not to change this...I'm still occasionally dealing with
> > clients spinning trying to connect...which brings down the whole cluster
> > until that one client is killed.
> >
> > Brian
> >
> > On Thu, Sep 27, 2012 at 7:55 PM, Patrick Hunt <[email protected]> wrote:
> >
> >> The random sleep was explicitly added to reduce herd effects and
> >> general "spinning client" problems iirc. Keep in mind that ZK
> >> generally trades of performance for availability. It wouldn't be a
> >> good idea to remove it in general. If anything we should have a more
> >> aggressive backoff policy in the case where clients are just spinning.
> >>
> >> Perhaps a plug-able approach here? Where the default is something like
> >> what we already have, but allow users to implement their own policy if
> >> they like. We could have a few implementations "out of the box"; 1)
> >> current, 2) no wait, 3) exponential backoff after trying each server
> >> in the ensemble, etc... This would also allow for experimentation.
> >>
> >> Patrick
> >>
> >> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <[email protected]
> >
> >> wrote:
> >> > Hi Sergei,
> >> >
> >> > Your suggestion sounds reasonable to me. I think the sleep was added
> >> > so that the client doesn't spin when the entire zookeeper is down. The
> >> > client could try to connect to each server without sleep, and sleep
> >> > for 1 second only after failing to connect to all the servers in the
> >> > cluster.
> >> >
> >> > Thanks!
> >> > --Michi
> >> >
> >> > On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
> >> > <[email protected]> wrote:
> >> >> Hi,
> >> >> Zookeeper implements a delay of up to 1 second before trying to
> >> reconnect.
> >> >>
> >> >> ClientCnxn$SendThread
> >> >>         @Override
> >> >>         public void run() {
> >> >>             ...
> >> >>             while (state.isAlive()) {
> >> >>                 try {
> >> >>                     if (!clientCnxnSocket.isConnected()) {
> >> >>                         if(!isFirstConnect){
> >> >>                             try {
> >> >>                                 Thread.sleep(r.nextInt(1000));
> >> >>                             } catch (InterruptedException e) {
> >> >>                                 LOG.warn("Unexpected exception", e);
> >> >>                             }
> >> >>
> >> >> This creates "outages" (even with simple retry on ConnectionLoss) up
> to
> >> 1s
> >> >> even with perfectly healthy cluster like in scenario of rolling
> >> restart. In
> >> >> our scenario it might be a problem under high load creating a spike
> in a
> >> >> number of requests waiting on zk operation.
> >> >> Would it be a better strategy to perform reconnect attempt
> immediately
> >> at
> >> >> least one time? Or there is more to it?
> >>
> >
> >
> >
> > --
> > http://about.me/BrianTarbox
>



-- 
http://about.me/BrianTarbox

Re: Zookeeper delay to reconnect

Reply via email to