What I see is ClientCnxn.processEvent hits the bottom catch block and immediately retries causing the spin. I can make this happen by suspending an event thread...or randomly under conditions that I have not pinned down yet.
Brian On Thu, Sep 27, 2012 at 8:07 PM, Patrick Hunt <[email protected]> wrote: > Hi Brian, well, in my proposal the default would be the current > behavior. With the discretion of the zk operator to change, so it > shouldn't be any worse. > > You've piqued my interest - a single client attempting to connect is > responsible for bringing down the entire cluster? Could you provide > more details? > > Patrick > > On Thu, Sep 27, 2012 at 4:58 PM, Brian Tarbox <[email protected]> > wrote: > > I would lobby not to change this...I'm still occasionally dealing with > > clients spinning trying to connect...which brings down the whole cluster > > until that one client is killed. > > > > Brian > > > > On Thu, Sep 27, 2012 at 7:55 PM, Patrick Hunt <[email protected]> wrote: > > > >> The random sleep was explicitly added to reduce herd effects and > >> general "spinning client" problems iirc. Keep in mind that ZK > >> generally trades of performance for availability. It wouldn't be a > >> good idea to remove it in general. If anything we should have a more > >> aggressive backoff policy in the case where clients are just spinning. > >> > >> Perhaps a plug-able approach here? Where the default is something like > >> what we already have, but allow users to implement their own policy if > >> they like. We could have a few implementations "out of the box"; 1) > >> current, 2) no wait, 3) exponential backoff after trying each server > >> in the ensemble, etc... This would also allow for experimentation. > >> > >> Patrick > >> > >> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <[email protected] > > > >> wrote: > >> > Hi Sergei, > >> > > >> > Your suggestion sounds reasonable to me. I think the sleep was added > >> > so that the client doesn't spin when the entire zookeeper is down. The > >> > client could try to connect to each server without sleep, and sleep > >> > for 1 second only after failing to connect to all the servers in the > >> > cluster. > >> > > >> > Thanks! > >> > --Michi > >> > > >> > On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich > >> > <[email protected]> wrote: > >> >> Hi, > >> >> Zookeeper implements a delay of up to 1 second before trying to > >> reconnect. > >> >> > >> >> ClientCnxn$SendThread > >> >> @Override > >> >> public void run() { > >> >> ... > >> >> while (state.isAlive()) { > >> >> try { > >> >> if (!clientCnxnSocket.isConnected()) { > >> >> if(!isFirstConnect){ > >> >> try { > >> >> Thread.sleep(r.nextInt(1000)); > >> >> } catch (InterruptedException e) { > >> >> LOG.warn("Unexpected exception", e); > >> >> } > >> >> > >> >> This creates "outages" (even with simple retry on ConnectionLoss) up > to > >> 1s > >> >> even with perfectly healthy cluster like in scenario of rolling > >> restart. In > >> >> our scenario it might be a problem under high load creating a spike > in a > >> >> number of requests waiting on zk operation. > >> >> Would it be a better strategy to perform reconnect attempt > immediately > >> at > >> >> least one time? Or there is more to it? > >> > > > > > > > > -- > > http://about.me/BrianTarbox > -- http://about.me/BrianTarbox
