Is this issue related to the recent email thread "Working around Leader election Listner thread death"
https://issues.apache.org/jira/browse/ZOOKEEPER-2186 On Thu, Apr 14, 2016 at 2:54 AM, Flavio Junqueira <[email protected]> wrote: > Other than some kind of funky packet filtering rule, I'm not sure why > you'd not be receiving the ACKs. > > I think that reconfiguring isn't the right way of addressing the problem. > If you have some underlying issue, configuration or even bad hardware, then > adding more nodes will not fix it. Even worse, it might lurking there for > some time and might come back to bite you later. > > If you do lose a machine (e.g., permanent failure, decommission), then it > does make sense to reconfigure the ensemble. > > -Flavio > > > > On 14 Apr 2016, at 01:12, s influxdb <[email protected]> wrote: > > > > Thanks Flavio. > > > > Would you know why node2 could not receive ACK from the other 2 nodes . > > > > What is the workaround in scenarios like these where in a 3 node cluster > 1 node is not responding > > ** If we do a rolling restart there is a possiblity of a downtime > > ** Add 2 more nodes to the configs and do a rolling restart > > ** Could you think of any way to fix node 2 so that it rejoins the > cluster. > > > > Would appreciate your reply. > > > > > > > > On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <[email protected] > <mailto:[email protected]>> wrote: > > Good to hear you've been able to sort it out. > > > > -Flavio > > > > > On 12 Apr 2016, at 03:02, s influxdb <[email protected] <mailto: > [email protected]>> wrote: > > > > > > created a parallel independant zookeeper cluster on the same set of > > > machines with different ports and that worked. This indicates the port > was > > > the issue. > > > > > > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <[email protected] > <mailto:[email protected]>> wrote: > > > > > >> reboot of the server didn't help > > >> > > >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <[email protected] > <mailto:[email protected]>> wrote: > > >> > > >>> I ran tcpdump on all the three nodes. > > >>> It looks like that for every [PSH, ACK] there is a missing [ACK] > from > > >>> the other nodes to this 2nd node on port 3888. > > >>> > > >>> > > >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <[email protected] > <mailto:[email protected]>> wrote: > > >>> > > >>>> Thanks Flavio for your quick replies. > > >>>> The zookeeper version is 3.4.6 > > >>>> > > >>>> > > >>>> > > >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <[email protected] > <mailto:[email protected]>> > > >>>> wrote: > > >>>> > > >>>>> You need to determine why it is not receiving notification > messages. > > >>>>> From > > >>>>> the information you've given, it doesn't look like a zookeeper code > > >>>>> issue. > > >>>>> > > >>>>> BTW, which version are you using? > > >>>>> > > >>>>> -Flavio > > >>>>> On 7 Apr 2016 21:20, "s influxdb" <[email protected] <mailto: > [email protected]>> wrote: > > >>>>> > > >>>>>> nothin on the iptables firewall . > > >>>>>> > > >>>>>> What options do i have to reconnect this node to the cluster ? > > >>>>>> > > >>>>>> > > >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb < > [email protected] <mailto:[email protected]>> > > >>>>> wrote: > > >>>>>> > > >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see > > >>>>>>> java.net.SocketTimeoutException: connect timed out messages in > the > > >>>>> logs > > >>>>>> for > > >>>>>>> node 2 > > >>>>>>> > > >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <[email protected] > <mailto:[email protected]>> > > >>>>> wrote: > > >>>>>>> > > >>>>>>>> I only see notifications from the node to itself. It says that > it > > >>>>> is > > >>>>>>>> connected to 1, but it doesn't seem to be receiving the > > >>>>> notification > > >>>>>> from > > >>>>>>>> 1. It also doesn't seem to be receiving the connection request > > >>>>> from 3. > > >>>>>>>> > > >>>>>>>> Last time I've seen something like this was due to iptables > rules, > > >>>>> but > > >>>>>> if > > >>>>>>>> it was working before and no configuration has changed, then I > > >>>>> don't > > >>>>>> know > > >>>>>>>> what it could be. > > >>>>>>>> > > >>>>>>>> -Flavio > > >>>>>>>> > > >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <[email protected] > <mailto:[email protected]>> > > >>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> this is the pastie > > >>>>>>>>> http://pastie.org/10788301 <http://pastie.org/10788301> > > >>>>>>>>> > > >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb < > > >>>>> [email protected] <mailto:[email protected]>> > > >>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError: > > >>>>> unable > > >>>>>> to > > >>>>>>>>>> create new native thread and then being unresponsive. > > >>>>>>>>>> > > >>>>>>>>>> We tried to add the node back to the cluster but with no luck. > > >>>>>>>>>> > > >>>>>>>>>> It doesn't seem to "Receive any notification " messages from > > >>>>> the > > >>>>>> other > > >>>>>>>>>> nodes. > > >>>>>>>>>> Keeps "Sending notifications " in loop > > >>>>>>>>>> > > >>>>>>>>>> Please see attached the logs of the node that is out of > > >>>>> rotation. > > >>>>>>>>>> > > >>>>>>>>>> Any inputs appreciated. > > >>>>>>>>>> > > >>>>>>>>>> Thanks > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > > > > >
