>> It looks similar to ZOOKEEPER-2164 but there it is a connection timeout where
Node 2 is not reachable.

It sounds to me that issue described in this email is the same as
ZOOKEEPER-2164 after checking the attached log of node 1 and node 3. Let's
check a specific time window for both logs, for example starting at
2016-10-03 13:58:40, what happened:

- n1 and n3 exchanged notifications and n3 is elected as leader.
- n1 tries to connect to n2 and failed and retried a couple of time.
- During this period, n1 can't connect to n3 because of n1->n2 connection
attempts.
- n3 aborts and back to looking state. And this repeats.

There is a patch for this
https://issues.apache.org/jira/browse/ZOOKEEPER-900 - might worth to try
the patch ?


On Thu, Oct 13, 2016 at 10:16 AM, Anand Parthasarathy <
anpar...@avinetworks.com> wrote:

> Just wanted to let you know that at this time, one of the node is powered
> off and the other two nodes took more than 10 minutes to converge. Our
> script exits and so, we don't know when it exactly converged. Normally, it
> takes < 100 seconds to converge.
>
> Thanks,
> Anand.
>
> On Thu, Oct 13, 2016 at 10:09 AM, Anand Parthasarathy <
> anpar...@avinetworks.com> wrote:
>
> > Hi Michael,
> >
> > We have reproduced this issue on a private AWS setup that has public IP
> > access. I will send you the details of the instance IP and the
> credentials
> > separately. If it needs to be shared with more people, I am happy to
> share
> > with them as well.
> >
> > Thanks
> > Anand.
> >
> > On Tue, Oct 11, 2016 at 3:46 PM, Michael Han <h...@cloudera.com> wrote:
> >
> >> Hi Anand,
> >>
> >> >> We have isolated it to a test setup, where we are able
> >> to reproduce this somewhat consistently if we keep a node powered off.
> >>
> >> Do you mind share your setup / steps to reproduce if the setup only
> >> involves ZooKeeper without other dependencies?
> >>
> >>
> >> On Tue, Oct 11, 2016 at 2:56 PM, Anand Parthasarathy <
> >> anpar...@avinetworks.com> wrote:
> >>
> >> > Folks,
> >> >
> >> > Sending a quick note again to find out if there is any insight the
> >> > community can offer in terms of a solution or workaround? We use
> >> zookeeper
> >> > for service discovery in our product and this issue has surfaced in a
> >> large
> >> > customer site a couple of times and we need to figure out a solution
> >> soon.
> >> >
> >> > Thanks,
> >> > Anand.
> >> >
> >> > On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy <
> >> > anpar...@avinetworks.com> wrote:
> >> >
> >> > > Folks,
> >> > >
> >> > > Any insight into this or any workarounds that you can think of to
> >> > mitigate
> >> > > against this issue? We have isolated it to a test setup, where we
> are
> >> > able
> >> > > to reproduce this somewhat consistently if we keep a node powered
> off.
> >> > >
> >> > > Thanks,
> >> > > Anand.
> >> > >
> >> > > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy <
> >> > > anpar...@avinetworks.com> wrote:
> >> > >
> >> > >> Hi Flavio,
> >> > >>
> >> > >> I have attached the logs from node 1 and node 3. Node 2 was powered
> >> off
> >> > >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16
> >> > when it
> >> > >> finally converged.
> >> > >>
> >> > >> Thanks,
> >> > >> Anand.
> >> > >>
> >> > >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <f...@apache.org>
> >> > wrote:
> >> > >>
> >> > >>> Hi Anand,
> >> > >>>
> >> > >>> I don't understand whether 1 and 3 were able or even trying to
> >> connect
> >> > >>> to each other. They should be able to elect a leader between them
> >> and
> >> > make
> >> > >>> progress. You might want to upload logs and let us know.
> >> > >>>
> >> > >>> -Flavio
> >> > >>>
> >> > >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
> >> > >>> anpar...@avinetworks.com> wrote:
> >> > >>> >
> >> > >>> > Hi,
> >> > >>> >
> >> > >>> > We are currently using zookeeper 3.4.6 version and use a 3 node
> >> > >>> solution in
> >> > >>> > our system. We see that occasionally, when a node is powered off
> >> (in
> >> > >>> this
> >> > >>> > instance, it was actually a leader node), the remaining two
> nodes
> >> do
> >> > >>> not
> >> > >>> > form a quorum for a really long time. Looking at the logs, it
> >> appears
> >> > >>> the
> >> > >>> > sequence is as follows:
> >> > >>> > - Node 2 is the zookeeper leader
> >> > >>> > - Node 2 is powered off
> >> > >>> > - Node 1 and Node 3 recognize and start the election
> >> > >>> > - Node 3 times out after initLimit * tickTime with "Timeout
> while
> >> > >>> waiting
> >> > >>> > for quorum" for Round N
> >> > >>> > - Node 1 times out after initLimit * tickTime with "Exception
> >> while
> >> > >>> trying
> >> > >>> > to follow leader" for Round N+1 at the same time.
> >> > >>> > - And the process continues where N is sequentially
> incrementing.
> >> > >>> > - This happens for a long time.
> >> > >>> > - In one instance, we used tickTime=5000 and initLimit=20 and it
> >> took
> >> > >>> > around 3.5 hours to converge.
> >> > >>> > - In a given round, Node 1 will try connecting to Node 2, gets
> >> > >>> connection
> >> > >>> > refused waits for notification timeout which increases by 2
> every
> >> > >>> iteration
> >> > >>> > until it hits the initLimit. Connection Refused is because the
> >> node 2
> >> > >>> comes
> >> > >>> > up after reboot, but zookeeper process is not started (due to a
> >> > >>> different
> >> > >>> > failure).
> >> > >>> >
> >> > >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection
> >> > timeout
> >> > >>> > where Node 2 is not reachable.
> >> > >>> >
> >> > >>> > Could you pls. share if you have seen this issue and if so, what
> >> is
> >> > the
> >> > >>> > workaround that can be employed in 3.4.6.
> >> > >>> >
> >> > >>> > Thanks,
> >> > >>> > Anand.
> >> > >>>
> >> > >>>
> >> > >>
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Cheers
> >> Michael.
> >>
> >
> >
>



-- 
Cheers
Michael.

Reply via email to