Hi Brian, well, in my proposal the default would be the current behavior. With the discretion of the zk operator to change, so it shouldn't be any worse.
You've piqued my interest - a single client attempting to connect is responsible for bringing down the entire cluster? Could you provide more details? Patrick On Thu, Sep 27, 2012 at 4:58 PM, Brian Tarbox <[email protected]> wrote: > I would lobby not to change this...I'm still occasionally dealing with > clients spinning trying to connect...which brings down the whole cluster > until that one client is killed. > > Brian > > On Thu, Sep 27, 2012 at 7:55 PM, Patrick Hunt <[email protected]> wrote: > >> The random sleep was explicitly added to reduce herd effects and >> general "spinning client" problems iirc. Keep in mind that ZK >> generally trades of performance for availability. It wouldn't be a >> good idea to remove it in general. If anything we should have a more >> aggressive backoff policy in the case where clients are just spinning. >> >> Perhaps a plug-able approach here? Where the default is something like >> what we already have, but allow users to implement their own policy if >> they like. We could have a few implementations "out of the box"; 1) >> current, 2) no wait, 3) exponential backoff after trying each server >> in the ensemble, etc... This would also allow for experimentation. >> >> Patrick >> >> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <[email protected]> >> wrote: >> > Hi Sergei, >> > >> > Your suggestion sounds reasonable to me. I think the sleep was added >> > so that the client doesn't spin when the entire zookeeper is down. The >> > client could try to connect to each server without sleep, and sleep >> > for 1 second only after failing to connect to all the servers in the >> > cluster. >> > >> > Thanks! >> > --Michi >> > >> > On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich >> > <[email protected]> wrote: >> >> Hi, >> >> Zookeeper implements a delay of up to 1 second before trying to >> reconnect. >> >> >> >> ClientCnxn$SendThread >> >> @Override >> >> public void run() { >> >> ... >> >> while (state.isAlive()) { >> >> try { >> >> if (!clientCnxnSocket.isConnected()) { >> >> if(!isFirstConnect){ >> >> try { >> >> Thread.sleep(r.nextInt(1000)); >> >> } catch (InterruptedException e) { >> >> LOG.warn("Unexpected exception", e); >> >> } >> >> >> >> This creates "outages" (even with simple retry on ConnectionLoss) up to >> 1s >> >> even with perfectly healthy cluster like in scenario of rolling >> restart. In >> >> our scenario it might be a problem under high load creating a spike in a >> >> number of requests waiting on zk operation. >> >> Would it be a better strategy to perform reconnect attempt immediately >> at >> >> least one time? Or there is more to it? >> > > > > -- > http://about.me/BrianTarbox
