Re: One node crashing in 3.4.11 triggered a full ensemble restart

Jerry Hebert Thu, 03 Oct 2019 11:17:42 -0700

This is really useful discussion, I really appreciate it! I'm not too
worried about the restarts that I saw and they are totally unrelated to the
upgrade. The upgrade is only relevant insofar as I was seeking confidence
that I would not see the issue once upgraded to 3.5.5 but I'm inclined to
believe the restarts were due to Exhibitor.


Whether or not I can create a mixed version ensemble is a far more
important question to me since I'm currently trying to devise an upgrade
strategy that avoids taking downtime.

Thanks,
Jerry

On Thu, Oct 3, 2019 at 6:59 AM Enrico Olivelli <eolive...@gmail.com> wrote:

> I think it is possible to perform a rolling upgrade from 3.4, all of my
> customers migrated one year ago and without any issue (reported to my
> team).
>
> Norbert, where did you find that information?
>
> btw I would like to setup tests about backward compatibility,
> server-to-server and client-to-server
>
> Enrico
>
> Il giorno gio 3 ott 2019 alle ore 15:16 Jörn Franke <jornfra...@gmail.com>
> ha scritto:
>
> > I tried only from 3.4.14 and there it was possible. I recommend first to
> > upgrade to the latest 3.4 version and then to 3.5
> >
> > > Am 02.10.2019 um 21:40 schrieb Jerry Hebert <jerry.heb...@gmail.com>:
> > >
> > > Hi Jörn,
> > >
> > > No, this was a very intermittent issue. We've been running this
> ensemble
> > > for about four years now and have never seen this problem so it seems
> to
> > be
> > > super heisenbuggy. Our upgrade process will be more involved than what
> > you
> > > described (we're switching networks, instance types, underlying
> > automation
> > > and removing Exhibitor) but I'm glad you asked because I have a
> question
> > > about that too. :)
> > >
> > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> ensemble?
> > I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> > >
> > > Thanks,
> > > Jerry
> > >
> > >> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jornfra...@gmail.com>
> > wrote:
> > >>
> > >> Have you tried to stop the node, delete the data and log directory,
> > >> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> > >>
> > >>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <jerry.heb...@gmail.com
> >:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> My first post here! I'm hoping you all might be able to offer some
> > >> guidance
> > >>> or redirect me to an existing ticket. We have a five node ensemble on
> > >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > >>> recently saw some bizarre behavior in our ensemble that I was hoping
> to
> > >>> find some sort pre-existing ticket or discussion about but I was
> having
> > >>> difficulty finding hits for this in Jira.
> > >>>
> > >>> The behavior that we saw from our metrics is that one of our nodes
> (not
> > >>> sure if it was a follower or a leader) started to demonstrate
> > >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but
> as
> > >>> soon as it crashed, all of the other four nodes all immediately
> > >> restarted,
> > >>> resulting in a short outage. One node crashing should never cause an
> > >>> ensemble restart of course, so I assumed that this must be a bug in
> ZK.
> > >> The
> > >>> nodes that restarted had no indication of errors in their logs, they
> > just
> > >>> simply restarted. Does this sound familiar to any of you?
> > >>>
> > >>> Also, we are using Exhibitor on that ensemble so it's also possible
> > that
> > >>> the restart was caused by Exhibitor.
> > >>>
> > >>> My hope is that this issue will be behind us once the 3.5.5 upgrade
> is
> > >>> complete but I'd ideally like to find some concrete evidence of this.
> > >>>
> > >>> Thanks!
> > >>> Jerry
> > >>
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Reply via email to