Hi Michael, We have reproduced this issue on a private AWS setup that has public IP access. I will send you the details of the instance IP and the credentials separately. If it needs to be shared with more people, I am happy to share with them as well.
Thanks Anand. On Tue, Oct 11, 2016 at 3:46 PM, Michael Han <h...@cloudera.com> wrote: > Hi Anand, > > >> We have isolated it to a test setup, where we are able > to reproduce this somewhat consistently if we keep a node powered off. > > Do you mind share your setup / steps to reproduce if the setup only > involves ZooKeeper without other dependencies? > > > On Tue, Oct 11, 2016 at 2:56 PM, Anand Parthasarathy < > anpar...@avinetworks.com> wrote: > > > Folks, > > > > Sending a quick note again to find out if there is any insight the > > community can offer in terms of a solution or workaround? We use > zookeeper > > for service discovery in our product and this issue has surfaced in a > large > > customer site a couple of times and we need to figure out a solution > soon. > > > > Thanks, > > Anand. > > > > On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy < > > anpar...@avinetworks.com> wrote: > > > > > Folks, > > > > > > Any insight into this or any workarounds that you can think of to > > mitigate > > > against this issue? We have isolated it to a test setup, where we are > > able > > > to reproduce this somewhat consistently if we keep a node powered off. > > > > > > Thanks, > > > Anand. > > > > > > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy < > > > anpar...@avinetworks.com> wrote: > > > > > >> Hi Flavio, > > >> > > >> I have attached the logs from node 1 and node 3. Node 2 was powered > off > > >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16 > > when it > > >> finally converged. > > >> > > >> Thanks, > > >> Anand. > > >> > > >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <f...@apache.org> > > wrote: > > >> > > >>> Hi Anand, > > >>> > > >>> I don't understand whether 1 and 3 were able or even trying to > connect > > >>> to each other. They should be able to elect a leader between them and > > make > > >>> progress. You might want to upload logs and let us know. > > >>> > > >>> -Flavio > > >>> > > >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy < > > >>> anpar...@avinetworks.com> wrote: > > >>> > > > >>> > Hi, > > >>> > > > >>> > We are currently using zookeeper 3.4.6 version and use a 3 node > > >>> solution in > > >>> > our system. We see that occasionally, when a node is powered off > (in > > >>> this > > >>> > instance, it was actually a leader node), the remaining two nodes > do > > >>> not > > >>> > form a quorum for a really long time. Looking at the logs, it > appears > > >>> the > > >>> > sequence is as follows: > > >>> > - Node 2 is the zookeeper leader > > >>> > - Node 2 is powered off > > >>> > - Node 1 and Node 3 recognize and start the election > > >>> > - Node 3 times out after initLimit * tickTime with "Timeout while > > >>> waiting > > >>> > for quorum" for Round N > > >>> > - Node 1 times out after initLimit * tickTime with "Exception while > > >>> trying > > >>> > to follow leader" for Round N+1 at the same time. > > >>> > - And the process continues where N is sequentially incrementing. > > >>> > - This happens for a long time. > > >>> > - In one instance, we used tickTime=5000 and initLimit=20 and it > took > > >>> > around 3.5 hours to converge. > > >>> > - In a given round, Node 1 will try connecting to Node 2, gets > > >>> connection > > >>> > refused waits for notification timeout which increases by 2 every > > >>> iteration > > >>> > until it hits the initLimit. Connection Refused is because the > node 2 > > >>> comes > > >>> > up after reboot, but zookeeper process is not started (due to a > > >>> different > > >>> > failure). > > >>> > > > >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection > > timeout > > >>> > where Node 2 is not reachable. > > >>> > > > >>> > Could you pls. share if you have seen this issue and if so, what is > > the > > >>> > workaround that can be employed in 3.4.6. > > >>> > > > >>> > Thanks, > > >>> > Anand. > > >>> > > >>> > > >> > > > > > > > > > -- > Cheers > Michael. >