Just wanted to let you know that at this time, one of the node is powered
off and the other two nodes took more than 10 minutes to converge. Our
script exits and so, we don't know when it exactly converged. Normally, it
takes < 100 seconds to converge.

Thanks,
Anand.

On Thu, Oct 13, 2016 at 10:09 AM, Anand Parthasarathy <
anpar...@avinetworks.com> wrote:

> Hi Michael,
>
> We have reproduced this issue on a private AWS setup that has public IP
> access. I will send you the details of the instance IP and the credentials
> separately. If it needs to be shared with more people, I am happy to share
> with them as well.
>
> Thanks
> Anand.
>
> On Tue, Oct 11, 2016 at 3:46 PM, Michael Han <h...@cloudera.com> wrote:
>
>> Hi Anand,
>>
>> >> We have isolated it to a test setup, where we are able
>> to reproduce this somewhat consistently if we keep a node powered off.
>>
>> Do you mind share your setup / steps to reproduce if the setup only
>> involves ZooKeeper without other dependencies?
>>
>>
>> On Tue, Oct 11, 2016 at 2:56 PM, Anand Parthasarathy <
>> anpar...@avinetworks.com> wrote:
>>
>> > Folks,
>> >
>> > Sending a quick note again to find out if there is any insight the
>> > community can offer in terms of a solution or workaround? We use
>> zookeeper
>> > for service discovery in our product and this issue has surfaced in a
>> large
>> > customer site a couple of times and we need to figure out a solution
>> soon.
>> >
>> > Thanks,
>> > Anand.
>> >
>> > On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy <
>> > anpar...@avinetworks.com> wrote:
>> >
>> > > Folks,
>> > >
>> > > Any insight into this or any workarounds that you can think of to
>> > mitigate
>> > > against this issue? We have isolated it to a test setup, where we are
>> > able
>> > > to reproduce this somewhat consistently if we keep a node powered off.
>> > >
>> > > Thanks,
>> > > Anand.
>> > >
>> > > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy <
>> > > anpar...@avinetworks.com> wrote:
>> > >
>> > >> Hi Flavio,
>> > >>
>> > >> I have attached the logs from node 1 and node 3. Node 2 was powered
>> off
>> > >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16
>> > when it
>> > >> finally converged.
>> > >>
>> > >> Thanks,
>> > >> Anand.
>> > >>
>> > >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <f...@apache.org>
>> > wrote:
>> > >>
>> > >>> Hi Anand,
>> > >>>
>> > >>> I don't understand whether 1 and 3 were able or even trying to
>> connect
>> > >>> to each other. They should be able to elect a leader between them
>> and
>> > make
>> > >>> progress. You might want to upload logs and let us know.
>> > >>>
>> > >>> -Flavio
>> > >>>
>> > >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
>> > >>> anpar...@avinetworks.com> wrote:
>> > >>> >
>> > >>> > Hi,
>> > >>> >
>> > >>> > We are currently using zookeeper 3.4.6 version and use a 3 node
>> > >>> solution in
>> > >>> > our system. We see that occasionally, when a node is powered off
>> (in
>> > >>> this
>> > >>> > instance, it was actually a leader node), the remaining two nodes
>> do
>> > >>> not
>> > >>> > form a quorum for a really long time. Looking at the logs, it
>> appears
>> > >>> the
>> > >>> > sequence is as follows:
>> > >>> > - Node 2 is the zookeeper leader
>> > >>> > - Node 2 is powered off
>> > >>> > - Node 1 and Node 3 recognize and start the election
>> > >>> > - Node 3 times out after initLimit * tickTime with "Timeout while
>> > >>> waiting
>> > >>> > for quorum" for Round N
>> > >>> > - Node 1 times out after initLimit * tickTime with "Exception
>> while
>> > >>> trying
>> > >>> > to follow leader" for Round N+1 at the same time.
>> > >>> > - And the process continues where N is sequentially incrementing.
>> > >>> > - This happens for a long time.
>> > >>> > - In one instance, we used tickTime=5000 and initLimit=20 and it
>> took
>> > >>> > around 3.5 hours to converge.
>> > >>> > - In a given round, Node 1 will try connecting to Node 2, gets
>> > >>> connection
>> > >>> > refused waits for notification timeout which increases by 2 every
>> > >>> iteration
>> > >>> > until it hits the initLimit. Connection Refused is because the
>> node 2
>> > >>> comes
>> > >>> > up after reboot, but zookeeper process is not started (due to a
>> > >>> different
>> > >>> > failure).
>> > >>> >
>> > >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection
>> > timeout
>> > >>> > where Node 2 is not reachable.
>> > >>> >
>> > >>> > Could you pls. share if you have seen this issue and if so, what
>> is
>> > the
>> > >>> > workaround that can be employed in 3.4.6.
>> > >>> >
>> > >>> > Thanks,
>> > >>> > Anand.
>> > >>>
>> > >>>
>> > >>
>> > >
>> >
>>
>>
>>
>> --
>> Cheers
>> Michael.
>>
>
>

Reply via email to