Re: Failure to rejoin ensemble after reboot

Marshall McMullen Mon, 09 Jul 2012 07:20:15 -0700

I completely agree. This is not the first time this has caused problems for
sure.


At a minimum a more helpful error message when there is a missing myid file
or a collision in server IDs would have been a life saver.

On Mon, Jul 9, 2012 at 8:16 AM, Camille Fournier <[email protected]> wrote:

> I was thinking the same thing when I answered that email earlier this week
> about the lack of myid causing an error that is difficult to trace. I kind
> of hate the myid file, why is it necessary in the first place? There must
> be a cleaner way for us to identify servers and avoid conflicts.
>
> C
>
> On Mon, Jul 9, 2012 at 10:14 AM, Marshall McMullen <
> [email protected]> wrote:
>
> > As it turns out, it was a configuration problem. We use zookeeper in an
> > embedded manner so our application code creates the myid file
> > programatically when we start zookeeper. After the reboot, it was
> creating
> > the 'myid' file and putting the wrong value in there. This was a value of
> > another ensemble node already in the cluster. I can't believe how much
> time
> > was wasted on such a simple configuration problem. Given how fatal this
> > was, it might have been useful if ZK could have detected multiple servers
> > with the same ID and given a more helpful error message. But in any
> event,
> > problem is solved now.... thanks for taking the time to respond Camille.
> >
> > On Mon, Jul 9, 2012 at 8:09 AM, Camille Fournier <[email protected]>
> > wrote:
> >
> > > That is very strange. What do the logs of the misbehaving server say?
> > What
> > > do the logs of the other servers say? What does a stack dump of the
> > > misbehaving server look like?
> > > Also, just to clarify, if you don't do anything but fully stop and
> > restart
> > > the cluster (no deleting version-2 files etc) the whole ensemble will
> > > reform successfully?
> > >
> > > C
> > >
> > > On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen <
> > > [email protected]> wrote:
> > >
> > > > I'm trying to get to the bottom of a problem we're seeing where
> after I
> > > > forcibly reboot an ensemble node (running on Linux) via "reboot -f"
> it
> > is
> > > > unable to rejoin the ensemble and no clients can connect to it. Has
> > > anyone
> > > > ever seen a problem like this before?
> > > >
> > > > I have been investigating this under
> > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the
> surface
> > > it
> > > > looked like there was some sort of transaction/log corruption going
> on.
> > > But
> > > > now I'm not so sure of that.
> > > >
> > > > What bothers me the most right now is that I am unable to reliably
> get
> > > the
> > > > node in question to rejoin the ensemble. I've removed the contents of
> > the
> > > > "version-2" directory and restarted zookeeper to no avail. It
> > regenerates
> > > > an epoch file but never obtains the new database from a peer. I event
> > > went
> > > > so far as to copy the on-disk database from another node and restart
> > > > zookeeper and I still can't get it to rejoin the ensemble. I've also
> > > > seen anomalous behavior where once I get it into this failed state, I
> > > just
> > > > stopped all three zookeeper server processes entirely then start them
> > all
> > > > back up... then everything connects and all three nodes are in the
> > > > ensemble. But this really shouldn't be necessary.
> > > >
> > > > None of this matches the behavior I expected. Anyone have any insight
> > it
> > > > would be greatly appreciated.
> > > >
> > >
> >
>

Re: Failure to rejoin ensemble after reboot

Reply via email to