Experimental status of readonlymode feature

2019-10-03 Thread Lewis Gardner
Hi,

The readonlymode feature was added 8 years ago but is still marked as 
experimental and requires setting a JVM system property to enable it.

What steps are required to promote this feature to "fully supported" status 
and allow enablement via the "readonlymode.enabled" setting in zoo.cfg?

thanks,
Lewis


signature.asc
Description: This is a digitally signed message part.


Re: One node crashing in 3.4.11 triggered a full ensemble restart

2019-10-03 Thread Jerry Hebert
This is really useful discussion, I really appreciate it! I'm not too
worried about the restarts that I saw and they are totally unrelated to the
upgrade. The upgrade is only relevant insofar as I was seeking confidence
that I would not see the issue once upgraded to 3.5.5 but I'm inclined to
believe the restarts were due to Exhibitor.

Whether or not I can create a mixed version ensemble is a far more
important question to me since I'm currently trying to devise an upgrade
strategy that avoids taking downtime.

Thanks,
Jerry

On Thu, Oct 3, 2019 at 6:59 AM Enrico Olivelli  wrote:

> I think it is possible to perform a rolling upgrade from 3.4, all of my
> customers migrated one year ago and without any issue (reported to my
> team).
>
> Norbert, where did you find that information?
>
> btw I would like to setup tests about backward compatibility,
> server-to-server and client-to-server
>
> Enrico
>
> Il giorno gio 3 ott 2019 alle ore 15:16 Jörn Franke 
> ha scritto:
>
> > I tried only from 3.4.14 and there it was possible. I recommend first to
> > upgrade to the latest 3.4 version and then to 3.5
> >
> > > Am 02.10.2019 um 21:40 schrieb Jerry Hebert :
> > >
> > > Hi Jörn,
> > >
> > > No, this was a very intermittent issue. We've been running this
> ensemble
> > > for about four years now and have never seen this problem so it seems
> to
> > be
> > > super heisenbuggy. Our upgrade process will be more involved than what
> > you
> > > described (we're switching networks, instance types, underlying
> > automation
> > > and removing Exhibitor) but I'm glad you asked because I have a
> question
> > > about that too. :)
> > >
> > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> ensemble?
> > I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> > >
> > > Thanks,
> > > Jerry
> > >
> > >> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke 
> > wrote:
> > >>
> > >> Have you tried to stop the node, delete the data and log directory,
> > >> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> > >>
> >  Am 02.10.2019 um 20:14 schrieb Jerry Hebert  >:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> My first post here! I'm hoping you all might be able to offer some
> > >> guidance
> > >>> or redirect me to an existing ticket. We have a five node ensemble on
> > >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > >>> recently saw some bizarre behavior in our ensemble that I was hoping
> to
> > >>> find some sort pre-existing ticket or discussion about but I was
> having
> > >>> difficulty finding hits for this in Jira.
> > >>>
> > >>> The behavior that we saw from our metrics is that one of our nodes
> (not
> > >>> sure if it was a follower or a leader) started to demonstrate
> > >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but
> as
> > >>> soon as it crashed, all of the other four nodes all immediately
> > >> restarted,
> > >>> resulting in a short outage. One node crashing should never cause an
> > >>> ensemble restart of course, so I assumed that this must be a bug in
> ZK.
> > >> The
> > >>> nodes that restarted had no indication of errors in their logs, they
> > just
> > >>> simply restarted. Does this sound familiar to any of you?
> > >>>
> > >>> Also, we are using Exhibitor on that ensemble so it's also possible
> > that
> > >>> the restart was caused by Exhibitor.
> > >>>
> > >>> My hope is that this issue will be behind us once the 3.5.5 upgrade
> is
> > >>> complete but I'd ideally like to find some concrete evidence of this.
> > >>>
> > >>> Thanks!
> > >>> Jerry
> > >>
> >
>


Re: One node crashing in 3.4.11 triggered a full ensemble restart

2019-10-03 Thread Enrico Olivelli
I think it is possible to perform a rolling upgrade from 3.4, all of my
customers migrated one year ago and without any issue (reported to my team).

Norbert, where did you find that information?

btw I would like to setup tests about backward compatibility,
server-to-server and client-to-server

Enrico

Il giorno gio 3 ott 2019 alle ore 15:16 Jörn Franke 
ha scritto:

> I tried only from 3.4.14 and there it was possible. I recommend first to
> upgrade to the latest 3.4 version and then to 3.5
>
> > Am 02.10.2019 um 21:40 schrieb Jerry Hebert :
> >
> > Hi Jörn,
> >
> > No, this was a very intermittent issue. We've been running this ensemble
> > for about four years now and have never seen this problem so it seems to
> be
> > super heisenbuggy. Our upgrade process will be more involved than what
> you
> > described (we're switching networks, instance types, underlying
> automation
> > and removing Exhibitor) but I'm glad you asked because I have a question
> > about that too. :)
> >
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > wasn't sure if that would work or not. e.g., maybe I could bring up the
> new
> > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> nodes,
> > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> >
> > Thanks,
> > Jerry
> >
> >> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke 
> wrote:
> >>
> >> Have you tried to stop the node, delete the data and log directory,
> >> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> >>
>  Am 02.10.2019 um 20:14 schrieb Jerry Hebert :
> >>>
> >>> Hi all,
> >>>
> >>> My first post here! I'm hoping you all might be able to offer some
> >> guidance
> >>> or redirect me to an existing ticket. We have a five node ensemble on
> >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> >>> recently saw some bizarre behavior in our ensemble that I was hoping to
> >>> find some sort pre-existing ticket or discussion about but I was having
> >>> difficulty finding hits for this in Jira.
> >>>
> >>> The behavior that we saw from our metrics is that one of our nodes (not
> >>> sure if it was a follower or a leader) started to demonstrate
> >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> >>> soon as it crashed, all of the other four nodes all immediately
> >> restarted,
> >>> resulting in a short outage. One node crashing should never cause an
> >>> ensemble restart of course, so I assumed that this must be a bug in ZK.
> >> The
> >>> nodes that restarted had no indication of errors in their logs, they
> just
> >>> simply restarted. Does this sound familiar to any of you?
> >>>
> >>> Also, we are using Exhibitor on that ensemble so it's also possible
> that
> >>> the restart was caused by Exhibitor.
> >>>
> >>> My hope is that this issue will be behind us once the 3.5.5 upgrade is
> >>> complete but I'd ideally like to find some concrete evidence of this.
> >>>
> >>> Thanks!
> >>> Jerry
> >>
>


Re: One node crashing in 3.4.11 triggered a full ensemble restart

2019-10-03 Thread Jörn Franke
I tried only from 3.4.14 and there it was possible. I recommend first to 
upgrade to the latest 3.4 version and then to 3.5

> Am 02.10.2019 um 21:40 schrieb Jerry Hebert :
> 
> Hi Jörn,
> 
> No, this was a very intermittent issue. We've been running this ensemble
> for about four years now and have never seen this problem so it seems to be
> super heisenbuggy. Our upgrade process will be more involved than what you
> described (we're switching networks, instance types, underlying automation
> and removing Exhibitor) but I'm glad you asked because I have a question
> about that too. :)
> 
> Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> wasn't sure if that would work or not. e.g., maybe I could bring up the new
> 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
> five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> 
> Thanks,
> Jerry
> 
>> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke  wrote:
>> 
>> Have you tried to stop the node, delete the data and log directory,
>> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
>> 
 Am 02.10.2019 um 20:14 schrieb Jerry Hebert :
>>> 
>>> Hi all,
>>> 
>>> My first post here! I'm hoping you all might be able to offer some
>> guidance
>>> or redirect me to an existing ticket. We have a five node ensemble on
>>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
>>> recently saw some bizarre behavior in our ensemble that I was hoping to
>>> find some sort pre-existing ticket or discussion about but I was having
>>> difficulty finding hits for this in Jira.
>>> 
>>> The behavior that we saw from our metrics is that one of our nodes (not
>>> sure if it was a follower or a leader) started to demonstrate
>>> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
>>> soon as it crashed, all of the other four nodes all immediately
>> restarted,
>>> resulting in a short outage. One node crashing should never cause an
>>> ensemble restart of course, so I assumed that this must be a bug in ZK.
>> The
>>> nodes that restarted had no indication of errors in their logs, they just
>>> simply restarted. Does this sound familiar to any of you?
>>> 
>>> Also, we are using Exhibitor on that ensemble so it's also possible that
>>> the restart was caused by Exhibitor.
>>> 
>>> My hope is that this issue will be behind us once the 3.5.5 upgrade is
>>> complete but I'd ideally like to find some concrete evidence of this.
>>> 
>>> Thanks!
>>> Jerry
>> 


Re: One node crashing in 3.4.11 triggered a full ensemble restart

2019-10-03 Thread Jörn Franke
I can confirm that a rolling update from Zk 3.4 to ZK 3.5 is possible if and 
only if a ZK ensemble is used. standalone updates may introduce difficulties. 
Of course I cannot tell for all possible setups, but for a ZK ensemble with 
multiple Solr instances it is possible.

> Am 03.10.2019 um 14:55 schrieb Shawn Heisey :
> 
> On 10/3/2019 2:45 AM, Norbert Kalmar wrote:
>> As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
>> not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
>> this protocol version, so when the nodes try to communicate it will throw
>> an exception. Plus, it is not a goal to keep quorum protocol backward
>> compatible, so chances are even without the check it would not work.
> 
> This document suggests that a mixed environment of 3.4 and 3.5 will work:
> 
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
> 
> But you seem to be saying that it won't.
> 
> As a committer on the Lucene/Solr project (which uses ZK) I am wondering what 
> we can tell our users about upgrading ZK.  I was under the impression from 
> the wiki page I linked that they could do a rolling upgrade with zero 
> downtime, where they do one ZK server at a time.  Are you saying that this is 
> not possible?
> 
> The Upgrade FAQ that you linked doesn't say anything about 3.4 and 3.5 not 
> working together.  The only big gotcha I see there is ZOOKEEPER-3056, which 
> has a workaround.
> 
> (I think of 4lw whitelisting as just a config problem with a new default, not 
> a true upgrade issue)
> 
> Thanks,
> Shawn


Re: One node crashing in 3.4.11 triggered a full ensemble restart

2019-10-03 Thread Shawn Heisey

On 10/3/2019 2:45 AM, Norbert Kalmar wrote:

As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
this protocol version, so when the nodes try to communicate it will throw
an exception. Plus, it is not a goal to keep quorum protocol backward
compatible, so chances are even without the check it would not work.


This document suggests that a mixed environment of 3.4 and 3.5 will work:

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement

But you seem to be saying that it won't.

As a committer on the Lucene/Solr project (which uses ZK) I am wondering 
what we can tell our users about upgrading ZK.  I was under the 
impression from the wiki page I linked that they could do a rolling 
upgrade with zero downtime, where they do one ZK server at a time.  Are 
you saying that this is not possible?


The Upgrade FAQ that you linked doesn't say anything about 3.4 and 3.5 
not working together.  The only big gotcha I see there is 
ZOOKEEPER-3056, which has a workaround.


(I think of 4lw whitelisting as just a config problem with a new 
default, not a true upgrade issue)


Thanks,
Shawn


Re: One node crashing in 3.4.11 triggered a full ensemble restart

2019-10-03 Thread Norbert Kalmar
Hi,

Here are the issues we encountered so far upgrading to 3.5.5 from 3.4:
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Upgrade+FAQ

As Enrico mentioned, nothing similar so far. One is no snapshot taken yet
the other is 4 letter words needs to be whitelisted.

As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
this protocol version, so when the nodes try to communicate it will throw
an exception. Plus, it is not a goal to keep quorum protocol backward
compatible, so chances are even without the check it would not work.

Regards,
Norbert

On Thu, Oct 3, 2019 at 12:09 AM Enrico Olivelli  wrote:

> Il mer 2 ott 2019, 22:52 Jerry Hebert  ha scritto:
>
> > Hi Enrico,
> >
> > The nodes that restarted did not have any errors in their logs, they
> seemed
> > to simply restart successfully so I think your hunch about the external
> > system is probably correct.
> >
> > Could you comment on my second question above regarding cross-version
> > migration or should I make a new thread?
> >
>
>
> I am not aware of any issue about an upgrade from 3.4 to 3.5 similar to
> your case. It is expected to work.
>
> Enrico
>
>
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> >
> >
> > Thanks!
> > Jerry
> >
> > On Wed, Oct 2, 2019 at 1:12 PM Enrico Olivelli 
> > wrote:
> >
> > > Any particular error/stacktrace in the logs?
> > > If it is zookeeper that is self killing it should log it, otherwise is
> > some
> > > other external system, I am sorry I don't know Exhibitor
> > >
> > > Hope that helps
> > > Enrico
> > >
> > > Il mer 2 ott 2019, 21:40 Jerry Hebert  ha
> > scritto:
> > >
> > > > Hi Jörn,
> > > >
> > > > No, this was a very intermittent issue. We've been running this
> > ensemble
> > > > for about four years now and have never seen this problem so it seems
> > to
> > > be
> > > > super heisenbuggy. Our upgrade process will be more involved than
> what
> > > you
> > > > described (we're switching networks, instance types, underlying
> > > automation
> > > > and removing Exhibitor) but I'm glad you asked because I have a
> > question
> > > > about that too. :)
> > > >
> > > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> > ensemble?
> > > I
> > > > wasn't sure if that would work or not. e.g., maybe I could bring up
> the
> > > new
> > > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > > nodes,
> > > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> > boxes?
> > > >
> > > > Thanks,
> > > > Jerry
> > > >
> > > > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke 
> > > wrote:
> > > >
> > > > > Have you tried to stop the node, delete the data and log directory,
> > > > > upgrade to 3.5.5 , start the node and wait until it is
> synchronized ?
> > > > >
> > > > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <
> > jerry.heb...@gmail.com
> > > >:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > My first post here! I'm hoping you all might be able to offer
> some
> > > > > guidance
> > > > > > or redirect me to an existing ticket. We have a five node
> ensemble
> > on
> > > > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5.
> > We
> > > > > > recently saw some bizarre behavior in our ensemble that I was
> > hoping
> > > to
> > > > > > find some sort pre-existing ticket or discussion about but I was
> > > having
> > > > > > difficulty finding hits for this in Jira.
> > > > > >
> > > > > > The behavior that we saw from our metrics is that one of our
> nodes
> > > (not
> > > > > > sure if it was a follower or a leader) started to demonstrate
> > > > > > instability (high CPU, high RAM) and it crashed. Not a big deal,
> > but
> > > as
> > > > > > soon as it crashed, all of the other four nodes all immediately
> > > > > restarted,
> > > > > > resulting in a short outage. One node crashing should never cause
> > an
> > > > > > ensemble restart of course, so I assumed that this must be a bug
> in
> > > ZK.
> > > > > The
> > > > > > nodes that restarted had no indication of errors in their logs,
> > they
> > > > just
> > > > > > simply restarted. Does this sound familiar to any of you?
> > > > > >
> > > > > > Also, we are using Exhibitor on that ensemble so it's also
> possible
> > > > that
> > > > > > the restart was caused by Exhibitor.
> > > > > >
> > > > > > My hope is that this issue will be behind us once the 3.5.5
> upgrade
> > > is
> > > > > > complete but I'd ideally like to find some concrete evidence of
> > this.
> > > > > >
> > > > > > Thanks!
> > > > > > Jerry
> > > > >
> > > >
> > >
> >
>