Re: node 2 not rejoining cluster

s influxdb Thu, 01 Sep 2016 08:36:16 -0700

Is this issue related to the recent email thread

"Working around Leader election Listner thread death"


 https://issues.apache.org/jira/browse/ZOOKEEPER-2186

On Thu, Apr 14, 2016 at 2:54 AM, Flavio Junqueira <[email protected]> wrote:

> Other than some kind of funky packet filtering rule, I'm not sure why
> you'd not be receiving the ACKs.
>
> I think that reconfiguring isn't the right way of addressing the problem.
> If you have some underlying issue, configuration or even bad hardware, then
> adding more nodes will not fix it. Even worse, it might lurking there for
> some time and might come back to bite you later.
>
> If you do lose a machine (e.g., permanent failure, decommission), then it
> does make sense to reconfigure the ensemble.
>
> -Flavio
>
>
> > On 14 Apr 2016, at 01:12, s influxdb <[email protected]> wrote:
> >
> > Thanks Flavio.
> >
> > Would you know why node2 could not receive ACK from the other 2 nodes .
> >
> > What is the workaround in scenarios like these where in a 3 node cluster
> 1 node is not responding
> > ** If we do a rolling restart there is a possiblity of a downtime
> > ** Add 2 more nodes to the configs and do a rolling restart
> > ** Could you think of any way to fix node 2 so that it rejoins the
> cluster.
> >
> > Would appreciate your reply.
> >
> >
> >
> > On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <[email protected]
> <mailto:[email protected]>> wrote:
> > Good to hear you've been able to sort it out.
> >
> > -Flavio
> >
> > > On 12 Apr 2016, at 03:02, s influxdb <[email protected] <mailto:
> [email protected]>> wrote:
> > >
> > > created a parallel independant zookeeper cluster on the same set of
> > > machines with different ports and that worked. This indicates the port
> was
> > > the issue.
> > >
> > > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <[email protected]
> <mailto:[email protected]>> wrote:
> > >
> > >> reboot of the server didn't help
> > >>
> > >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <[email protected]
> <mailto:[email protected]>> wrote:
> > >>
> > >>> I ran tcpdump on all the three nodes.
> > >>> It looks like that for every  [PSH, ACK] there is a missing [ACK]
> from
> > >>> the other nodes to this 2nd node on port 3888.
> > >>>
> > >>>
> > >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <[email protected]
> <mailto:[email protected]>> wrote:
> > >>>
> > >>>> Thanks Flavio for your quick replies.
> > >>>> The zookeeper version is 3.4.6
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <[email protected]
> <mailto:[email protected]>>
> > >>>> wrote:
> > >>>>
> > >>>>> You need to determine why it is not receiving notification
> messages.
> > >>>>> From
> > >>>>> the information you've given, it doesn't look like a zookeeper code
> > >>>>> issue.
> > >>>>>
> > >>>>> BTW, which version are you using?
> > >>>>>
> > >>>>> -Flavio
> > >>>>> On 7 Apr 2016 21:20, "s influxdb" <[email protected] <mailto:
> [email protected]>> wrote:
> > >>>>>
> > >>>>>> nothin on the iptables firewall .
> > >>>>>>
> > >>>>>> What options do i have to reconnect this node to the cluster ?
> > >>>>>>
> > >>>>>>
> > >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <
> [email protected] <mailto:[email protected]>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see
> > >>>>>>> java.net.SocketTimeoutException: connect timed out messages in
> the
> > >>>>> logs
> > >>>>>> for
> > >>>>>>> node 2
> > >>>>>>>
> > >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <[email protected]
> <mailto:[email protected]>>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> I only see notifications from the node to itself. It says that
> it
> > >>>>> is
> > >>>>>>>> connected to 1, but it doesn't seem to be receiving the
> > >>>>> notification
> > >>>>>> from
> > >>>>>>>> 1. It also doesn't seem to be receiving the connection request
> > >>>>> from 3.
> > >>>>>>>>
> > >>>>>>>> Last time I've seen something like this was due to iptables
> rules,
> > >>>>> but
> > >>>>>> if
> > >>>>>>>> it was working before and no configuration has changed, then I
> > >>>>> don't
> > >>>>>> know
> > >>>>>>>> what it could be.
> > >>>>>>>>
> > >>>>>>>> -Flavio
> > >>>>>>>>
> > >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <[email protected]
> <mailto:[email protected]>>
> > >>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> this is the pastie
> > >>>>>>>>> http://pastie.org/10788301 <http://pastie.org/10788301>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
> > >>>>> [email protected] <mailto:[email protected]>>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
> > >>>>> unable
> > >>>>>> to
> > >>>>>>>>>> create new native thread and then being unresponsive.
> > >>>>>>>>>>
> > >>>>>>>>>> We tried to add the node back to the cluster but with no luck.
> > >>>>>>>>>>
> > >>>>>>>>>> It doesn't seem to "Receive any notification "  messages from
> > >>>>> the
> > >>>>>> other
> > >>>>>>>>>> nodes.
> > >>>>>>>>>> Keeps "Sending notifications " in loop
> > >>>>>>>>>>
> > >>>>>>>>>> Please see attached the logs of the node that is out of
> > >>>>> rotation.
> > >>>>>>>>>>
> > >>>>>>>>>> Any inputs appreciated.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> >
>
>

Re: node 2 not rejoining cluster

Reply via email to