Re: One node crashing in 3.4.11 triggered a full ensemble restart

Jörn Franke Thu, 03 Oct 2019 06:16:48 -0700

I tried only from 3.4.14 and there it was possible. I recommend first to 
upgrade to the latest 3.4 version and then to 3.5


> Am 02.10.2019 um 21:40 schrieb Jerry Hebert <[email protected]>:
> 
> Hi Jörn,
> 
> No, this was a very intermittent issue. We've been running this ensemble
> for about four years now and have never seen this problem so it seems to be
> super heisenbuggy. Our upgrade process will be more involved than what you
> described (we're switching networks, instance types, underlying automation
> and removing Exhibitor) but I'm glad you asked because I have a question
> about that too. :)
> 
> Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> wasn't sure if that would work or not. e.g., maybe I could bring up the new
> 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
> five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> 
> Thanks,
> Jerry
> 
>> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <[email protected]> wrote:
>> 
>> Have you tried to stop the node, delete the data and log directory,
>> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
>> 
>>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <[email protected]>:
>>> 
>>> Hi all,
>>> 
>>> My first post here! I'm hoping you all might be able to offer some
>> guidance
>>> or redirect me to an existing ticket. We have a five node ensemble on
>>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
>>> recently saw some bizarre behavior in our ensemble that I was hoping to
>>> find some sort pre-existing ticket or discussion about but I was having
>>> difficulty finding hits for this in Jira.
>>> 
>>> The behavior that we saw from our metrics is that one of our nodes (not
>>> sure if it was a follower or a leader) started to demonstrate
>>> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
>>> soon as it crashed, all of the other four nodes all immediately
>> restarted,
>>> resulting in a short outage. One node crashing should never cause an
>>> ensemble restart of course, so I assumed that this must be a bug in ZK.
>> The
>>> nodes that restarted had no indication of errors in their logs, they just
>>> simply restarted. Does this sound familiar to any of you?
>>> 
>>> Also, we are using Exhibitor on that ensemble so it's also possible that
>>> the restart was caused by Exhibitor.
>>> 
>>> My hope is that this issue will be behind us once the 3.5.5 upgrade is
>>> complete but I'd ideally like to find some concrete evidence of this.
>>> 
>>> Thanks!
>>> Jerry
>>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Reply via email to