Folks, Sending a quick note again to find out if there is any insight the community can offer in terms of a solution or workaround? We use zookeeper for service discovery in our product and this issue has surfaced in a large customer site a couple of times and we need to figure out a solution soon.
Thanks, Anand. On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy < anpar...@avinetworks.com> wrote: > Folks, > > Any insight into this or any workarounds that you can think of to mitigate > against this issue? We have isolated it to a test setup, where we are able > to reproduce this somewhat consistently if we keep a node powered off. > > Thanks, > Anand. > > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy < > anpar...@avinetworks.com> wrote: > >> Hi Flavio, >> >> I have attached the logs from node 1 and node 3. Node 2 was powered off >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16 when it >> finally converged. >> >> Thanks, >> Anand. >> >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <f...@apache.org> wrote: >> >>> Hi Anand, >>> >>> I don't understand whether 1 and 3 were able or even trying to connect >>> to each other. They should be able to elect a leader between them and make >>> progress. You might want to upload logs and let us know. >>> >>> -Flavio >>> >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy < >>> anpar...@avinetworks.com> wrote: >>> > >>> > Hi, >>> > >>> > We are currently using zookeeper 3.4.6 version and use a 3 node >>> solution in >>> > our system. We see that occasionally, when a node is powered off (in >>> this >>> > instance, it was actually a leader node), the remaining two nodes do >>> not >>> > form a quorum for a really long time. Looking at the logs, it appears >>> the >>> > sequence is as follows: >>> > - Node 2 is the zookeeper leader >>> > - Node 2 is powered off >>> > - Node 1 and Node 3 recognize and start the election >>> > - Node 3 times out after initLimit * tickTime with "Timeout while >>> waiting >>> > for quorum" for Round N >>> > - Node 1 times out after initLimit * tickTime with "Exception while >>> trying >>> > to follow leader" for Round N+1 at the same time. >>> > - And the process continues where N is sequentially incrementing. >>> > - This happens for a long time. >>> > - In one instance, we used tickTime=5000 and initLimit=20 and it took >>> > around 3.5 hours to converge. >>> > - In a given round, Node 1 will try connecting to Node 2, gets >>> connection >>> > refused waits for notification timeout which increases by 2 every >>> iteration >>> > until it hits the initLimit. Connection Refused is because the node 2 >>> comes >>> > up after reboot, but zookeeper process is not started (due to a >>> different >>> > failure). >>> > >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection timeout >>> > where Node 2 is not reachable. >>> > >>> > Could you pls. share if you have seen this issue and if so, what is the >>> > workaround that can be employed in 3.4.6. >>> > >>> > Thanks, >>> > Anand. >>> >>> >> >