Re: Bootstrapping a new cluster and using the reconfig feature

2021-12-30 Thread Alexander Shraer
I used master. But as long as you’re using 3.5 and on you should be fine.
I’m not sure when the acl option was added but the standaloneEnabled and
reconfig features were in 3.5.0

On Thu, Dec 30, 2021 at 11:51 AM Eric Edgar
 wrote:

> This is great.  I should confirm what zk version you are using for your
> tests.
> Thanks,
> Eric
>
> On Thu, Dec 30, 2021 at 1:10 PM Alexander Shraer 
> wrote:
>
> > The reconfig is in process means something failed during reconfiguration
> > and it couldn't complete. Perhaps the new server disconnected in the
> middle
> > and never came back up. Notice that the second server's config file gets
> > overwritten after it connects to the leader, and if it reboots at this
> > stage it won't be able to connect again without you manually overwriting
> > its config file again (since in the server's config server 2 is not part
> of
> > the ensemble).
> >
> > I checked it locally (running both servers on my laptop), and it worked.
> > Perhaps start from that ?
> >
> > Like you said, I disabled acl by adding
> >
> > "-Dzookeeper.skipACL=yes"
> >
> > Here's the first server's config file: conf/zoo_replicated1.cfg
> >
> > dataDir=/Users/shralex/my-zookeeper/zookeeper1
> >
> > syncLimit=2
> >
> > initLimit=5
> >
> > tickTime=2000
> >
> > clientPort=2791
> >
> > reconfigEnabled=true
> >
> > standaloneEnabled=false
> >
> > server.1=localhost:2721:2731:participant;localhost:2791
> >
> > The second server's: conf/zoo_replicated2.cfg
> >
> > ataDir=/Users/shralex/my-zookeeper/zookeeper2
> >
> > syncLimit=2
> >
> > initLimit=5
> >
> > tickTime=2000
> >
> > clientPort=2792
> >
> > reconfigEnabled=true
> >
> > standaloneEnabled=false
> >
> > server.1=localhost:2721:2731:participant;localhost:2791
> >
> > server.2=localhost:2741:2751:participant;localhost:2792
> >
> > create 2 directories for the servers: zookeeper1 and zookeeper2 and
> create
> > myid files in each
> >
> > echo 1 > zookeeper1/myid
> >
> > echo 2 > zookeeper2/myid
> >
> > I find it easier for debugging to allow zkServer.sh to log to stdout. You
> > can do this by changing zkServer.sh:
> > - change nohup "$JAVA" to just "$JAVA"
> > - remove " > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null"
> >
> > In two shells start both servers by
> >
> > export ZOOCFG=zoo_replicated1.cfg  (change for server 2)
> >
> > ./bin/zkServer.sh start
> >
> > In a third shell I start the client by connecting it to server 2 as you
> did
> >
> > ./bin/zkCli.sh -server 127.0.0.1:2792
> >
> > I run the following in the shell:
> >
> > [zk: 127.0.0.1:2792(CONNECTED) 2] config
> >
> > server.1=localhost:2721:2731:participant;localhost:2791
> >
> > version=1
> >
> > [zk: 127.0.0.1:2792(CONNECTED) 2] reconfig -add
> > "server.2=localhost:2741:2751:participant;localhost:2792"
> >
> > Committed new configuration:
> >
> > server.1=localhost:2721:2731:participant;localhost:2791
> >
> > server.2=localhost:2741:2751:participant;localhost:2792
> >
> > version=20003
> >
> > On Thu, Dec 30, 2021 at 10:47 AM Eric Edgar
> >  wrote:
> >
> > > I am a little closer I think.  I disabled auth for testing using the
> > server
> > > flags .. but now I am getting a different error that the reconfig is in
> > > process and I see a zookeeper.dynamic.next file on both servers but
> > nothing
> > > happens after that.
> > > What would cause that file to not be merged into a new cfg.
> > > Eric
> > >
> > > On Thu, Dec 30, 2021 at 11:47 AM Eric Edgar <
> eric.ed...@smartthings.com>
> > > wrote:
> > >
> > > > Alex,
> > > > so I have 2 nodes .. the first has itself in the dynamic list with an
> > id
> > > > of 1.
> > > > server.1=10.1.1.104:2888:3888:participant;0.0.0.0:2181
> > > >
> > > > I have brought the second node up with an id of 2
> > > > server.1=10.1.1.104:2888:3888:participant;0.0.0.0:2181
> > > > server.2=10.1.1.40:2888:3888:participant;2181
> > > >
> > > > then i am trying to run from the second node.  zkCli.sh -server
> > > 10.1.1.104
> > > > reconfig -add "server.2=10.1.1.40:2888:3888:participant;2181"
> > > >

Re: Bootstrapping a new cluster and using the reconfig feature

2021-12-30 Thread Alexander Shraer
The reconfig is in process means something failed during reconfiguration
and it couldn't complete. Perhaps the new server disconnected in the middle
and never came back up. Notice that the second server's config file gets
overwritten after it connects to the leader, and if it reboots at this
stage it won't be able to connect again without you manually overwriting
its config file again (since in the server's config server 2 is not part of
the ensemble).

I checked it locally (running both servers on my laptop), and it worked.
Perhaps start from that ?

Like you said, I disabled acl by adding

"-Dzookeeper.skipACL=yes"

Here's the first server's config file: conf/zoo_replicated1.cfg

dataDir=/Users/shralex/my-zookeeper/zookeeper1

syncLimit=2

initLimit=5

tickTime=2000

clientPort=2791

reconfigEnabled=true

standaloneEnabled=false

server.1=localhost:2721:2731:participant;localhost:2791

The second server's: conf/zoo_replicated2.cfg

ataDir=/Users/shralex/my-zookeeper/zookeeper2

syncLimit=2

initLimit=5

tickTime=2000

clientPort=2792

reconfigEnabled=true

standaloneEnabled=false

server.1=localhost:2721:2731:participant;localhost:2791

server.2=localhost:2741:2751:participant;localhost:2792

create 2 directories for the servers: zookeeper1 and zookeeper2 and create
myid files in each

echo 1 > zookeeper1/myid

echo 2 > zookeeper2/myid

I find it easier for debugging to allow zkServer.sh to log to stdout. You
can do this by changing zkServer.sh:
- change nohup "$JAVA" to just "$JAVA"
- remove " > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null"

In two shells start both servers by

export ZOOCFG=zoo_replicated1.cfg  (change for server 2)

./bin/zkServer.sh start

In a third shell I start the client by connecting it to server 2 as you did

./bin/zkCli.sh -server 127.0.0.1:2792

I run the following in the shell:

[zk: 127.0.0.1:2792(CONNECTED) 2] config

server.1=localhost:2721:2731:participant;localhost:2791

version=1

[zk: 127.0.0.1:2792(CONNECTED) 2] reconfig -add
"server.2=localhost:2741:2751:participant;localhost:2792"

Committed new configuration:

server.1=localhost:2721:2731:participant;localhost:2791

server.2=localhost:2741:2751:participant;localhost:2792

version=20003

On Thu, Dec 30, 2021 at 10:47 AM Eric Edgar
 wrote:

> I am a little closer I think.  I disabled auth for testing using the server
> flags .. but now I am getting a different error that the reconfig is in
> process and I see a zookeeper.dynamic.next file on both servers but nothing
> happens after that.
> What would cause that file to not be merged into a new cfg.
> Eric
>
> On Thu, Dec 30, 2021 at 11:47 AM Eric Edgar 
> wrote:
>
> > Alex,
> > so I have 2 nodes .. the first has itself in the dynamic list with an id
> > of 1.
> > server.1=10.1.1.104:2888:3888:participant;0.0.0.0:2181
> >
> > I have brought the second node up with an id of 2
> > server.1=10.1.1.104:2888:3888:participant;0.0.0.0:2181
> > server.2=10.1.1.40:2888:3888:participant;2181
> >
> > then i am trying to run from the second node.  zkCli.sh -server
> 10.1.1.104
> > reconfig -add "server.2=10.1.1.40:2888:3888:participant;2181"
> >
> >
> >
> > I get this error on the first server
> > 2021-12-30 17:37:02,880 [myid:1] - INFO  [ProcessThread(sid:1
> > cport:-1)::PrepRequestProcessor@461] - Incremental reconfig
> > 2021-12-30 17:37:02,880 [myid:1] - WARN  [ProcessThread(sid:1
> > cport:-1)::PrepRequestProcessor@532] - Reconfig failed - there must be a
> > connected and synced quorum in new configuration
> > 2021-12-30 17:37:02,880 [myid:1] - INFO  [ProcessThread(sid:1
> > cport:-1)::PrepRequestProcessor@935] - Got user-level KeeperException
> > when processing sessionid:0x1002dfe65610014 type:reconfig cxid:0x1
> > zxid:0x160033 txntype:-1 reqpath:n
> >
> >
> > on the second server issuing the reconfig command I get this error
> > No quorum of new config is connected and up-to-date with the leader of
> > last commmitted config - try invoking reconfiguration after new servers
> are
> > connected and synced
> >
> > I have not set any security at this point.
> >
> > I am not sure what I am missing at this point, assuming I don't need 2
> > nodes fully clustered in advance as mentioned by Chris.
> >
> > Thanks,
> > Eric
> >
> > On Thu, Dec 30, 2021 at 11:03 AM Alexander Shraer 
> > wrote:
> >
> >> This is already possible, since the 3.5.0 release:
> >>
> >>
> https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#sc_reconfig_standaloneEnabled
> >>
> >> After your single node is up and running, you can connect other nodes

Re: Bootstrapping a new cluster and using the reconfig feature

2021-12-30 Thread Alexander Shraer
This is already possible, since the 3.5.0 release:
https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#sc_reconfig_standaloneEnabled

After your single node is up and running, you can connect other nodes to it
as described in the reconfig manual. See "Adding servers" in the link above.
Essentially, you need to specify the new server's initial config files so
that they can find some existing server and start syncing data. Once a
quorum
of the new config is up to date, you can invoke the reconfig command to
officially make them part of the configuration.

Thanks,
Alex

On Thu, Dec 30, 2021 at 8:57 AM Eric Edgar
 wrote:

> Also would it be possible to update the code for this edge case,  eg if the
> current quorum is 1, and you want to add a node then add a flag saying I
> trust the single master and reconfigure itself into a 2 node cluster?
> Thanks,
> Eric
>
> On Thu, Dec 30, 2021 at 10:49 AM Eric Edgar 
> wrote:
>
> > Are there any examples with a k8 orchestrator or some sort of docker init
> > scripts handling the initial cluster configuration?
> > Thanks,
> > Eric
> >
> > On Thu, Dec 30, 2021 at 9:44 AM Chris T.  wrote:
> >
> >> If you want to run a zookeeper cluster you have to start with at least 2
> >> members. From there you can scale up with the dynamic reconfig commands.
> >> Regards
> >> Chris
> >>
> >> On 30 December 2021 16:40:40 Eric Edgar
> >>  wrote:
> >>
> >> > I am experimenting with zk and the reconfig feature and trying to
> >> > understand if I can start a single zk node and then reconfig/bootstrap
> >> the
> >> > other 2 nodes into the ensemble.  The reconfig command is throwing an
> >> error
> >> > that there isn't a quorum yet.  Is this line of thinking possible?  or
> >> do I
> >> > need to setup the first 3 nodes manually the first time?
> >> > I am basing this experiment off of this web page.
> >> >
> >>
> https://blog.container-solutions.com/dynamic-zookeeper-cluster-with-docker
> >> >
> >> > /opt/zookeeper/zookeeper/bin/zkCli.sh -server 10.1.1.104:2181
> reconfig
> >> -add
> >> > "server.2=10.1.1.40:2888:3888:participant;2181"
> >> > No quorum of new config is connected and up-to-date with the leader of
> >> last
> >> > commmitted config - try invoking reconfiguration after new servers are
> >> > connected and synced
> >> >
> >> > /opt/zookeeper/zookeeper/bin/zkCli.sh -server 10.1.1.104:2181 config
> >> > server.1=10.1.1.104:2888:3888:participant;0.0.0.0:2181
> >> >
> >> > cat ./zoo.cfg
> >> > autopurge.purgeInterval=1
> >> > initLimit=10
> >> > syncLimit=5
> >> > autopurge.snapRetainCount=6
> >> > tickTime=2000
> >> > dataDir=/mnt/zookeeper/data
> >> > reconfigEnabled=true
> >> > standaloneEnabled=false
> >> >
> >>
> dynamicConfigFile=/opt/zookeeper/zookeeper/conf/zoo.cfg.dynamic.16
> >> >
> >> > What is the best solution for an unattended bootstrap setup of a new
> >> > cluster from scratch?
> >> >
> >> >
> >> > This was something that we were able to accomplish with exhibitor on
> >> older
> >> > versions of zookeeper in the past.
> >>
> >>
>


Re: Dynamic Reconfiguration usage

2021-03-09 Thread Alexander Shraer
Hi,

The only things that can be changed dynamically are the ones in the dynamic
configuration file:
- list of servers,
- their ports,
- their roles (follower or observer)
- the quorum system definition (majority or hierarchical).

AFAIK all other parameters are in the static config file.

Thanks,
Alex

On Tue, Mar 9, 2021 at 2:22 PM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> Hi,
>
> Is the dynamic reconfiguration
>  feature
> only used for the server config or can be used for any other zookeeper
> configuration parameters?
>
> For example, can i change the log level from debug to info using dynamic
> reconfig with out restarting processes.
>
> Thanks,
> Ram
>


Re: Clarification on ZooKeeper Timeliness Guarantee

2021-03-05 Thread Alexander Shraer
Hi,

It sounds tricky to rely on this, because the clocks aren't perfectly in
sync across the clients and servers and clock rates may drift. For example,
the way syncLimit is counted by the leader may be slower than how B
measures it, so the leader might not drop the connection before B's read
even if the connection is having issues. If you make enough assumptions
about the clocks and server processing speeds, e.g., set syncLimit (or the
time at which B reads) very conservatively then its probably ok. But it's
much better not to rely on this, and to have correctness independent of
timing assumptions.

Alex


On Thu, Mar 4, 2021 at 6:59 AM Paulo Motta  wrote:

> Hi,
>
> ZooKeeper's documentation [1] mentions as Timeliness consistency guarantee:
>
> ---
> The clients view of the system is guaranteed to be up-to-date within a
> certain time bound. (On the order of tens of seconds.) Either system
> changes will be seen by a client within this bound, or the client will
> detect a service outage.
> --
>
> Can we safely assume this guarantee is governed by the syncLimit
> configuration property? That is, if a client A successfully writes to a
> znode /example at T0, and another client B successfully reads from /example
> at T0 + syncLimit + 1 without any updates in between to this ZNode, client
> B is *guaranteed* to read the value written by A, even without explicitly
> calling the *sync API*?
>
> Thanks,
>
> Paulo
>
> [1] https://zookeeper.apache.org/doc/r3.3.3/zookeeperProgrammers.html
>


Re: upgrade from 3.4.5 to 3.5.6

2020-03-28 Thread Alexander Shraer
+1 to what Mate said (I wrote the quoted instructions).



On Tue, Mar 24, 2020 at 7:03 AM Szalay-Bekő Máté 
wrote:

> Hi Kuldeep,
>
> I just want to provide you some background info about our documentation.
> The reason to upgrade to 3.4.6 first is to avoid the following error:
>
> > 2013-01-30 11:32:10,663 [myid:2] - WARN [localhost/127.0.0.1:2784
> :QuorumCnxManager@349] - Invalid server id: -65536
>
> This error comes because of the protocol changes between ZooKeeper server
> nodes during connection initiation for leader election. In ZooKeeper 3.5 a
> protocol version was introduced (see ZOOKEEPER-107) and since that time the
> fist long value sent in the initial message is not the server ID but the
> protocol version (-65536). In ZooKeeper 3.4.6 we made the old 3.4
> ZooKeepers backward compatible, so they are able to parse both the old and
> the new protocol format (see ZOOKEEPER-1633). This issue happens only when
> you need to use old (3.4.0 - 3.4.5) and new (3.5.0+) ZooKeeper servers
> together in the same cluster. During a rolling upgrade, this is usually the
> case to have old and new ZooKeepers present together.
>
> The fact that you haven't seen any issues might be caused by the order of
> the servers. In ZooKeeper the connection initiation between the servers
> during the leader election follows a specific rule. As far as I remember
> always the server with the larger ID 'wins the challenge', so it is
> possible, that the old server didn't need to parse any initial message (if
> it had the largest ID) and this is why you haven't seen the issue. Also
> having 2 nodes up from the 3 nodes cluster still makes the cluster work (so
> you should also check if all the servers are part of the quorum).
>
> I agree with Enrico and Norbert, the safest and most stable way is upgrade
> first to 3.4.latest, then go to 3.5.latest. Still, if you don't see that
> you would hit this specific issue (e.g. no "Invalid server id" in the log
> files), and all the three servers can handle traffic, then maybe you don't
> need to upgrade first to 3.4.latest, it is your decision. Definitely you
> should test it first, as suggested by the others.
>
> Kind regards,
> Mate
>
> On Tue, Mar 24, 2020 at 12:29 PM Norbert Kalmar
>  wrote:
>
> > Hi,
> >
> > That guide is to upgrade to 3.5.0, which was an alpha version. A lot has
> > changed for the first stable release of 3.5.5 and then a few more, even
> > rolling upgrade issues have been fixed for 3.5.6.
> > This is a more up-to-date guide:
> > https://cwiki.apache.org/confluence/display/ZOOKEEPER/Upgrade+FAQ
> >
> > If you have done your testing (with prod snapshot!), then you can skip
> 3.4
> > latest upgrade, but keep in mind we do our recommendations for a reason.
> > There were issues reported and/or found during testing. Some are fixed
> with
> > 3.5.6, some only happens if certain conditions stand (IOException: No
> > snapshot found - mentioned in the guide, fixed in 3.5.6).
> >
> > So it is up to you, I would still recommend to do an 3.4 upgrade first,
> if
> > it's feasible.
> >
> > Regards,
> > Norbert
> >
> > On Tue, Mar 24, 2020 at 11:45 AM kuldeep singh <
> kuldeep.sing...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Current Zookeeper version :- 3.4.5
> > > Upgraded version:- 3.5.6
> > >
> > > We are not going with 3.5.7. Our final decision is zookeeper version is
> > > 3.5.6
> > > as per your reply first we need to move latest version of 3.4.x, like
> > below
> > >
> > > 3.4.5 -> 3.4.14 -> 3.5.6 (Correct me if I am wrong here)
> > >
> > > But if We are not facing any problem that i have shared you that we
> have
> > > set up of 3 node cluster where 2 node are on 3.5.6 version and 1 node
> on
> > > 3.4.5, Everything is running fine and didn't get any issue, So what
> other
> > > problem we can face if we directly move to 3.5.6
> > >
> > > Thanks,
> > > -
> > > Kuldeep Singh Budania
> > > Software Architect
> > >
> > >
> > > On Tue, Mar 24, 2020 at 3:58 PM Enrico Olivelli 
> > > wrote:
> > >
> > > > Hi
> > > > You have to upgrade to latest 3.4.x Zookeeper then you will upgrade
> to
> > > > 3.5.7.
> > > > All should run well without issues
> > > >
> > > >
> > > > Enrico
> > > >
> > > > Il Mar 24 Mar 2020, 10:18 kuldeep singh 
> ha
> > > > scritto:
> > > >
> > > > > Hi Team,
> > > > >
> > > > > We are upgrading zookeeper from 3.4.5 to 3.5.6. I have set up 3
> node
> > > > > cluster where 2 node are on 3.5.6 version and 1 node on 3.4.5.
> > > > >
> > > > > Everything is running fine and didn't get any issue on my system.
> > > > >
> > > > > but I found something on apache site  that first we need to upgrade
> > on
> > > > > 3.4.6 than we can upgrade to 3.5.6. So is it mandatory  to go on
> > 3.4.6
> > > > > first.
> > > > >
> > > > > *Upgrading to 3.5.0*
> > > > >
> > > > > Upgrading a running ZooKeeper ensemble to 3.5.0 should be done only
> > > after
> > > > > upgrading your ensemble to the 3.4.6 release. Note that this is
> 

Re: question on ZAB protocol

2020-02-15 Thread Alexander Shraer
Yes I believe that this is possible, not only in ZK but in many other
systems when your connection to the database fails and you don’t know
whether your transaction committed or aborted. Improving this is part of
the forever open Zookeeper-22 JIRA.

Alex

On Sat, Feb 15, 2020 at 6:35 PM jonefeewang  wrote:

> Norbert Kalmar-2 wrote
> > Hi,
> >
> > A would not have confirmed in this case to the client the write. Sending
> > ACK means the followers have written the transaction to disc. Leader (in
> > this case A) still needs to send COMMIT message to the followers.
> > It goes like this:
> > - LEADER(A) receives a write, so it creates a transaction and send it to
> > all FOLLOWERs.
> > - FOLLOWERs receive the transaction and writes it to disc (txnlog). It
> > does
> > NOT apply to the datatree.
> > - After writing to disc FOLLOWERs send ACK to LEADER(A) (Nothing at this
> > point is acknowledged to the client)
> > - After LEADER(A) receives quorum of ACK, then, and only then will it
> > apply
> > to the datatree and send COMMIT message to all FOLLOWERs to do the same.
> > And also ACK to client that the write is complete. And at this point the
> > data sent by the client is saved in the txnlogs of the quorum.
> >
> > Hope this helps,
> >
> > Regards,
> > Norbert
> >
> > On Sat, Feb 15, 2020 at 5:20 AM 
>
> > hnwyllmm@
>
> >  wrote:
> >
> >> How do you know A has sent the ack to client before he die ?
> >>
> >> 发自我的 iPhone
> >>
> >> > 在 2020年2月15日,09:15,jonefeewang 
>
> > jonefeewang@
>
> >  写道:
> >> >
> >> > I also have the same question like this below:
> >> >
> >> >
> >> > let's say we have nodes A B C D E, now A is the leader
> >> >
> >> > A broadcasts <1,1>,  it reaches B, then A, B die, C D E elect someone,
> >> > the new system is going to throw away <1,1> since it does not know its
> >> > existence, right?
> >> >
> >> > start from scratch,
> >> > A broadcasts<1,1> , it reaches all, all send ACK to A, but A dies
> >> > before receiving the ACK, then BCDE elects someone, and the new leader
> >> > sees <1,1> in log, so it broadcasts <1,1> to BCDE, which all commit
> >> > it.  now if we look back, when A dies, the client should get a "write
> >> > failure", but now after BCDE relection, the written value does get
> >> > into the system ??? the client and the cluster has an inconsistent
> view
> >> ??
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Sent from: http://zookeeper-user.578899.n2.nabble.com/
> >>
> >>
>
>
> Sorry, I think I need to make the question more clear :
>
> 1. A broadcasts<1,1> , it reaches all, all send ACK to A
> 2. A dies before receiving the ACK,
> 3. BCDE elects someone, and the new leader sees <1,1> in log, so it
> broadcasts <1,1> to BCDE, which all commit it.
>
>  now if we look back, when A dies, the client should get a "write
>  failure", but now after BCDE relection, the written value does get into
> the
> system 。
>
> so in the last, the client got a write error(probably think this write did
> not succeed), but the server clusters did write this value in their log and
> datatree.
>
> so the client and the cluster has an inconsistent view.
>
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>


Re: Upgrade guide from 3.4.x to 3.5.x?

2020-02-14 Thread Alexander Shraer
Hi, please see “upgrading to 3.5” section here:
https://zookeeper.apache.org/doc/r3.5.4-beta/zookeeperReconfig.html

On Fri, Feb 14, 2020 at 8:48 PM shrikant kalani 
wrote:

> Hi Allen
>
> We recently upgrade our Zookeeper clusters from 3.4.13 to 3.5.5.
>
> Yes the rolling upgrade are possible and it is backward compatible meaning
> zkclient running on version 3.4.13 can still interact with zkserver 3.5.5.
>
> Unless you want to leverage dynamic reconfiguration options , the rest of
> the configuration are very similar. With new version there are other
> interesting features like Authentication with Kerberos and TLS , Admin UI
> which all are optional.
>
> Thanks
> Srikant Kalani
> Sent from my iPhone
>
> > On 15 Feb 2020, at 6:11 AM, allen chan 
> wrote:
> >
> > Hello
> >
> > I have been trying to find a guide that describes upgrade process from
> > 3.4.x to 3.5.x.
> > I cannot find anything on the main zookeeper page.
> > What i am looking for are breaking changes, configuration changes,
> > compatibility matrix, is rolling upgrade ok?
> >
> > Thanks
> > --
> > Allen Michael Chan
>


Re: AW: Configuration management for zoo.cfg

2019-12-15 Thread Alexander Shraer
Yes, that sounds like a good change to me.


Alex

On Sun, Dec 15, 2019 at 4:15 PM Aristedes Maniatis  wrote:

> How about a new property:
>
> dynamicConfigHistory=3
>
> which would preserve 3 historic config files. Or
>
> dynamicConfigHistory=0
>
> which would keep none.
>
>
> Does that sound like a reasonable approach? A default value of 0 would
> be what most people expect, although it is a change for people already
> wanting a folder full of files.
>
> I agree that dynamicConfigFile should point to the actual dynamic file,
> not sometimes to the real file and sometimes to the prefix of the real
> file. If there is any history worth keeping, then rolling them over log
> style (either with timestamps or config id) is a much more understood
> behaviour.
>
>
>
> Ari
>
>
>
> On 16/12/19 9:53am, Alexander Shraer wrote:
> > I wasn't sure whether extracting such information from the log is simple,
> > and since reconfigurations may impact the cluster in significant ways (or
> > in the extreme bring it down completely)
> > an easily accessible record seemed good to have, at least for debugging.
> I
> > agree that this can be made configurable, and would also not mind very
> much
> > not having a history at all, if others don't find it very useful.
> > However this is a breaking change so probably requires more people to
> chime
> > in.
> >
> >> In case of some network issue, where a node repeatedly flaps, why would
> > you want to fill the directory with possibly thousands of files?
> >
> > Automating reconfigations was not part of the release, only the basic
> > mechanism was provided and not for example the policy of when you'd want
> to
> > reconfigure and what changes to do.
> > But I agree that an automatic system like that should take care of this
> > situation.
> >
> >
> > Alex
> >
> >
> > On Sun, Dec 15, 2019 at 2:26 PM Aristedes Maniatis 
> wrote:
> >
> >> Can you explain a bit more about the use-case for when you'd want to
> >> keep the history of the dynamic file. Surely the log file will contain
> >> information about peers joining and leaving the cluster and is easier to
> >> parse if you care about tracking that sort of thing.
> >>
> >> In case of some network issue, where a node repeatedly flaps, why would
> >> you want to fill the directory with possibly thousands of files?
> >>
> >>
> >> Ari
> >>
> >>
> >> On 15/12/19 2:35pm, Alexander Shraer wrote:
> >>> Hi Ari,
> >>>
> >>> Yes, you're totally right about the design goals.
> >>>
> >>> A mode where historic files aren't preserved could be useful. This
> >>> could perhaps be added to the static config file as a parameter.
> >>>
> >>> Alternatively / in addition, maybe we could slightly change the way
> >> history
> >>> is staved. I don't really like the fact that we're actually using
> >>> the file name to determine the version of the config (rather than
> >>> information inside the file), this is used internally in ZK to decide
> >> which
> >>> config to use (the one with higher number wins).
> >>> This method could fix this issue as well:
> >>> - File name always stays the same, addressing your problem, and we
> don't
> >>> need to edit the static config file every time.
> >>> - Dynamic config file contains the config version as a key.
> >>> - Before overwriting the dynamic config file, we store a file with the
> >>> previous config, including the version in the file name.
> >>>
> >>> This would change the current behavior a bit, hopefully no one is
> relying
> >>> on the file name to contain the version.
> >>>
> >>> This should not be difficult to implement, would you like to open a
> Jira
> >>> and take a stab at implementing it ? I can review it.
> >>>
> >>> Something to notice about the "version" of the config - currently when
> >> the
> >>> config is stored in memory, it appears as a key in the configuration.
> >> When
> >>> stored in the temporary config file (pre-commit), it appears as an
> >> explicit
> >>> key, but when committed it does not appear inside the dynamic file -
> only
> >>> in the file name.This is controlled by the last argument of
> >>>QuorumPeerConfig.writeDynamicConfig.
> >>>
> >>> See also QuorumP

Re: AW: Configuration management for zoo.cfg

2019-12-15 Thread Alexander Shraer
Another potential advantage is that if something bad happened, you would
have the latest working config readily available.

On Sun, Dec 15, 2019 at 2:53 PM Alexander Shraer  wrote:

> I wasn't sure whether extracting such information from the log is simple,
> and since reconfigurations may impact the cluster in significant ways (or
> in the extreme bring it down completely)
> an easily accessible record seemed good to have, at least for debugging. I
> agree that this can be made configurable, and would also not mind very much
> not having a history at all, if others don't find it very useful.
> However this is a breaking change so probably requires more people to
> chime in.
>
> > In case of some network issue, where a node repeatedly flaps, why would
> you want to fill the directory with possibly thousands of files?
>
> Automating reconfigations was not part of the release, only the basic
> mechanism was provided and not for example the policy of when you'd want to
> reconfigure and what changes to do.
> But I agree that an automatic system like that should take care of this
> situation.
>
>
> Alex
>
>
> On Sun, Dec 15, 2019 at 2:26 PM Aristedes Maniatis  wrote:
>
>> Can you explain a bit more about the use-case for when you'd want to
>> keep the history of the dynamic file. Surely the log file will contain
>> information about peers joining and leaving the cluster and is easier to
>> parse if you care about tracking that sort of thing.
>>
>> In case of some network issue, where a node repeatedly flaps, why would
>> you want to fill the directory with possibly thousands of files?
>>
>>
>> Ari
>>
>>
>> On 15/12/19 2:35pm, Alexander Shraer wrote:
>> > Hi Ari,
>> >
>> > Yes, you're totally right about the design goals.
>> >
>> > A mode where historic files aren't preserved could be useful. This
>> > could perhaps be added to the static config file as a parameter.
>> >
>> > Alternatively / in addition, maybe we could slightly change the way
>> history
>> > is staved. I don't really like the fact that we're actually using
>> > the file name to determine the version of the config (rather than
>> > information inside the file), this is used internally in ZK to decide
>> which
>> > config to use (the one with higher number wins).
>> > This method could fix this issue as well:
>> > - File name always stays the same, addressing your problem, and we don't
>> > need to edit the static config file every time.
>> > - Dynamic config file contains the config version as a key.
>> > - Before overwriting the dynamic config file, we store a file with the
>> > previous config, including the version in the file name.
>> >
>> > This would change the current behavior a bit, hopefully no one is
>> relying
>> > on the file name to contain the version.
>> >
>> > This should not be difficult to implement, would you like to open a Jira
>> > and take a stab at implementing it ? I can review it.
>> >
>> > Something to notice about the "version" of the config - currently when
>> the
>> > config is stored in memory, it appears as a key in the configuration.
>> When
>> > stored in the temporary config file (pre-commit), it appears as an
>> explicit
>> > key, but when committed it does not appear inside the dynamic file -
>> only
>> > in the file name.This is controlled by the last argument of
>> >   QuorumPeerConfig.writeDynamicConfig.
>> >
>> > See also QuorumPeerConfig.java parse() parseProperties() etc and
>> > QuorumPeer.java setQuorumVerifier().
>> >
>> > Thanks,
>> > Alex
>> >
>> > On Sat, Dec 14, 2019 at 6:32 PM Aristedes Maniatis 
>> wrote:
>> >
>> >> Will anything bad happen if I make the config file read-only for
>> >> zookeeper? I assume the design goals here were:
>> >>
>> >> * atomic rewrites of the dynamic config, preserving historic files
>> >>
>> >> * ability for zookeeper to know which was the most recent config file
>> on
>> >> restart
>> >>
>> >>
>> >> Those goals are a bit unnecessary for me. I don't really care about
>> >> historic configuration, so just writing to a temp file and moving over
>> >> the existing one would work great. Alternatively tracking the current
>> >> file in memory without rewriting the zoo.cfg would also be great, since
>> >> I don't care about the effo

Re: AW: Configuration management for zoo.cfg

2019-12-15 Thread Alexander Shraer
I wasn't sure whether extracting such information from the log is simple,
and since reconfigurations may impact the cluster in significant ways (or
in the extreme bring it down completely)
an easily accessible record seemed good to have, at least for debugging. I
agree that this can be made configurable, and would also not mind very much
not having a history at all, if others don't find it very useful.
However this is a breaking change so probably requires more people to chime
in.

> In case of some network issue, where a node repeatedly flaps, why would
you want to fill the directory with possibly thousands of files?

Automating reconfigations was not part of the release, only the basic
mechanism was provided and not for example the policy of when you'd want to
reconfigure and what changes to do.
But I agree that an automatic system like that should take care of this
situation.


Alex


On Sun, Dec 15, 2019 at 2:26 PM Aristedes Maniatis  wrote:

> Can you explain a bit more about the use-case for when you'd want to
> keep the history of the dynamic file. Surely the log file will contain
> information about peers joining and leaving the cluster and is easier to
> parse if you care about tracking that sort of thing.
>
> In case of some network issue, where a node repeatedly flaps, why would
> you want to fill the directory with possibly thousands of files?
>
>
> Ari
>
>
> On 15/12/19 2:35pm, Alexander Shraer wrote:
> > Hi Ari,
> >
> > Yes, you're totally right about the design goals.
> >
> > A mode where historic files aren't preserved could be useful. This
> > could perhaps be added to the static config file as a parameter.
> >
> > Alternatively / in addition, maybe we could slightly change the way
> history
> > is staved. I don't really like the fact that we're actually using
> > the file name to determine the version of the config (rather than
> > information inside the file), this is used internally in ZK to decide
> which
> > config to use (the one with higher number wins).
> > This method could fix this issue as well:
> > - File name always stays the same, addressing your problem, and we don't
> > need to edit the static config file every time.
> > - Dynamic config file contains the config version as a key.
> > - Before overwriting the dynamic config file, we store a file with the
> > previous config, including the version in the file name.
> >
> > This would change the current behavior a bit, hopefully no one is relying
> > on the file name to contain the version.
> >
> > This should not be difficult to implement, would you like to open a Jira
> > and take a stab at implementing it ? I can review it.
> >
> > Something to notice about the "version" of the config - currently when
> the
> > config is stored in memory, it appears as a key in the configuration.
> When
> > stored in the temporary config file (pre-commit), it appears as an
> explicit
> > key, but when committed it does not appear inside the dynamic file - only
> > in the file name.This is controlled by the last argument of
> >   QuorumPeerConfig.writeDynamicConfig.
> >
> > See also QuorumPeerConfig.java parse() parseProperties() etc and
> > QuorumPeer.java setQuorumVerifier().
> >
> > Thanks,
> > Alex
> >
> > On Sat, Dec 14, 2019 at 6:32 PM Aristedes Maniatis 
> wrote:
> >
> >> Will anything bad happen if I make the config file read-only for
> >> zookeeper? I assume the design goals here were:
> >>
> >> * atomic rewrites of the dynamic config, preserving historic files
> >>
> >> * ability for zookeeper to know which was the most recent config file on
> >> restart
> >>
> >>
> >> Those goals are a bit unnecessary for me. I don't really care about
> >> historic configuration, so just writing to a temp file and moving over
> >> the existing one would work great. Alternatively tracking the current
> >> file in memory without rewriting the zoo.cfg would also be great, since
> >> I don't care about the effort on startup to rediscover peers.
> >>
> >> Is there a way to get Zookeeper to play better with not rewriting its
> >> own config file for my use case?
> >>
> >>
> >> Ari
> >>
> >>
> >> On 12/12/19 5:53am, Alexander Shraer wrote:
> >>> It will change, the number represents the version of the configuration,
> >> and
> >>> will be updated if you issue a reconfiguration command. Its basically
> the
> >>> zxid of the command.
> >>>
> >>>
> >>> Alex
> >>>
> >>> On Tue, Dec 10, 2019 at 11:25 PM Aristedes Maniatis 
> >> wrote:
> >>>> On 11/12/19 6:21pm, arne.bachm...@dlr.de wrote:
> >>>>> Hey Ari,
> >>>>>
> >>>>> I directly used the filename   zoo.cfg.dynamic.1and never
> >>>> had a
> >>>>> problem.
> >>>>> Arne
> >>>> Hmmm... that's an oddly obvious answer. I just assumed the 1
> >>>> would change randomly. What's even the point of it?
> >>>>
> >>>>
> >>>> Ari
> >>>>
> >>>>
>


Re: AW: Configuration management for zoo.cfg

2019-12-14 Thread Alexander Shraer
Hi Ari,

Yes, you're totally right about the design goals.

A mode where historic files aren't preserved could be useful. This
could perhaps be added to the static config file as a parameter.

Alternatively / in addition, maybe we could slightly change the way history
is staved. I don't really like the fact that we're actually using
the file name to determine the version of the config (rather than
information inside the file), this is used internally in ZK to decide which
config to use (the one with higher number wins).
This method could fix this issue as well:
- File name always stays the same, addressing your problem, and we don't
need to edit the static config file every time.
- Dynamic config file contains the config version as a key.
- Before overwriting the dynamic config file, we store a file with the
previous config, including the version in the file name.

This would change the current behavior a bit, hopefully no one is relying
on the file name to contain the version.

This should not be difficult to implement, would you like to open a Jira
and take a stab at implementing it ? I can review it.

Something to notice about the "version" of the config - currently when the
config is stored in memory, it appears as a key in the configuration. When
stored in the temporary config file (pre-commit), it appears as an explicit
key, but when committed it does not appear inside the dynamic file - only
in the file name.This is controlled by the last argument of
 QuorumPeerConfig.writeDynamicConfig.

See also QuorumPeerConfig.java parse() parseProperties() etc and
QuorumPeer.java setQuorumVerifier().

Thanks,
Alex

On Sat, Dec 14, 2019 at 6:32 PM Aristedes Maniatis  wrote:

> Will anything bad happen if I make the config file read-only for
> zookeeper? I assume the design goals here were:
>
> * atomic rewrites of the dynamic config, preserving historic files
>
> * ability for zookeeper to know which was the most recent config file on
> restart
>
>
> Those goals are a bit unnecessary for me. I don't really care about
> historic configuration, so just writing to a temp file and moving over
> the existing one would work great. Alternatively tracking the current
> file in memory without rewriting the zoo.cfg would also be great, since
> I don't care about the effort on startup to rediscover peers.
>
> Is there a way to get Zookeeper to play better with not rewriting its
> own config file for my use case?
>
>
> Ari
>
>
> On 12/12/19 5:53am, Alexander Shraer wrote:
> > It will change, the number represents the version of the configuration,
> and
> > will be updated if you issue a reconfiguration command. Its basically the
> > zxid of the command.
> >
> >
> > Alex
> >
> > On Tue, Dec 10, 2019 at 11:25 PM Aristedes Maniatis 
> wrote:
> >
> >> On 11/12/19 6:21pm, arne.bachm...@dlr.de wrote:
> >>> Hey Ari,
> >>>
> >>> I directly used the filename   zoo.cfg.dynamic.1and never
> >> had a
> >>> problem.
> >>> Arne
> >>
> >> Hmmm... that's an oddly obvious answer. I just assumed the 1
> >> would change randomly. What's even the point of it?
> >>
> >>
> >> Ari
> >>
> >>
>


Re: AW: Configuration management for zoo.cfg

2019-12-11 Thread Alexander Shraer
It will change, the number represents the version of the configuration, and
will be updated if you issue a reconfiguration command. Its basically the
zxid of the command.


Alex

On Tue, Dec 10, 2019 at 11:25 PM Aristedes Maniatis  wrote:

>
> On 11/12/19 6:21pm, arne.bachm...@dlr.de wrote:
> > Hey Ari,
> >
> > I directly used the filename   zoo.cfg.dynamic.1and never
> had a
> > problem.
> > Arne
>
>
> Hmmm... that's an oddly obvious answer. I just assumed the 1
> would change randomly. What's even the point of it?
>
>
> Ari
>
>


Re: Re: a misunderstanding of ZAB

2019-09-05 Thread Alexander Shraer
> the global state is neither COMMITTED nor DROPPED.

Just like in Paxos, if a quorum ACKs (it does not matter whether L1
received the acks before crashing) then its guaranteed not to be lost.
If less then a quorum acked, them its unknown until recovery happens in
which case it could be committed or dropped, depends on what L2 knows.
The global state isn't known to any process though.

> so for a client, if write query takes too much time, the client may
receive Timeout Exception, and it must query servers again to know whether
previous write is SUCCESS or FAIL?
Yes. ZK doesn't currently have a good way of finding this out,
https://issues.apache.org/jira/browse/ZOOKEEPER-22


Alex



On Wed, Sep 4, 2019 at 10:37 PM 121476...@qq.com <121476...@qq.com> wrote:

> thank you, Michael. seems i got the idea.
> in case2, when L1 fails before receiving a quorum's ACKs, the global state
> is neither COMMITTED nor DROPPED.
> Until a new leader elected and syncs to his followers, if he has p1,then
> p1 will be committed; if he has not p1, then p1 will be dropped.
> so for a client, if write query takes too much time, the client may
> receive Timeout Exception, and it must query servers again to know whether
> previous write is SUCCESS or FAIL?
>
>
>
> 121476...@qq.com
>
> From: Michael Han
> Date: 2019-09-04 02:26
> To: user
> Subject: Re: a misunderstanding of ZAB
> +1 with what Alex has said.
>
> The commit case is easy to understand. For skip case, think this example:
>
> old quorum: F1 F2 F3 F4 F5, with F1 as L1. L1 has p on F1 and F2.
> new quorum: F1 F2 F3 F4 F5, with F3 as L2. It's possible, because although
> F1 and F2 has latest zxid, they could be partitioned away and F3 F4 F5 are
> enough to form quorum to elect a new leader.
>
> Now partition healed, the commit of p on F1 and F2 should be dropped (in
> ZK, this is what "TRUNC" sync is for).
>
> >> L2 become new leader, he should skip p1.
>
> If your L2 is F2 here, p1 will not be skipped, since p1 is available on F2
> the new leader.
>
> On Tue, Sep 3, 2019 at 10:35 AM Alexander Shraer 
> wrote:
>
> > In case2, it is possible that p1 is committed or dropped. It depends on
> > whether L2 knows about p1.
> > Note that L2 needs the support of a quorum to become leader, and in ZK
> > since there is no state copy from followers to leader, the leader
> candidate
> > needs to have the longest log.
> > So, if L2's log includes p1 it will be committed otherwise it will be
> > dropped.
> >
> > In case1 L2's log necessarily includes p1 since it is present at a quorum
> > and without having it in the log its not possible to have a log more
> > up-to-date than that of a quorum / get the support of a quorum to become
> > leader.
> >
> > Alex
> >
> >
> > On Tue, Sep 3, 2019 at 4:52 AM Norbert Kalmar
>  > >
> > wrote:
> >
> > > Hi,
> > >
> > > That's a good question. So if I understand correctly, you are asking
> what
> > > happens if there is a new Leader Election in ZooKeeper, what is the
> "last
> > > seen zxid". I checked the ZAB protocol, it is not entirely clear for me
> > as
> > > well, but my understanding is that the last seen zxid is the last
> > > transaction, which is read from txnlogs in case of a recovery.
> Honestly,
> > > there's nothing else this could be read from. So if it hasn't been
> > > committed to the datatree (and that exists in memory anyway, at least
> > until
> > > a snapshot is taken), it is still the last txn that is logged by one of
> > the
> > > followers, so he will win the Leader Election, and the followers will
> get
> > > this txn as well.
> > > Anyone agree/disagree? :)
> > >
> > > Regards,
> > > Norbert
> > >
> > > On Mon, Sep 2, 2019 at 4:50 AM 121476...@qq.com <121476...@qq.com>
> > wrote:
> > >
> > > > hi, i'm a new to zookeeper, and this problem confuses me for nearly
> two
> > > > months...
> > > > papers tell me that zab must satisfy:
> > > > A message delivered by one sever must be delivered on quorum.
> > > > A message skipped must always be skipped.
> > > > Then consider two cases below, L is short for leader, F is short for
> > > > follower, p is short for proposal.
> > > > Case1:
> > > > L send p1 to F2 F3 F4 F5.
> > > > F2 F3 ack p1, reach a quorum.
> > > > L1 is about to send commit but failed...
> > > > L2 become new leader, he should commit.
> > > >
> > > > Case2:
> > > > L1 send p1 to F2 F3 F4 F5.
> > > > Only F2 ack p1, not reach a quorum.
> > > > Then L1 failed...
> > > > L2 become new leader, he should skip p1.
> > > >
> > > > i think L2 should handle the cases in election phase. but
> how
> > L2
> > > > can know the global state and decide if commit p1 or skip p1?
> > > > if anyone helps, i will be much appreciate.
> > > >
> > > >
> > > >
> > > > 121476...@qq.com
> > > >
> > >
> >
>


Re: a misunderstanding of ZAB

2019-09-03 Thread Alexander Shraer
In case2, it is possible that p1 is committed or dropped. It depends on
whether L2 knows about p1.
Note that L2 needs the support of a quorum to become leader, and in ZK
since there is no state copy from followers to leader, the leader candidate
needs to have the longest log.
So, if L2's log includes p1 it will be committed otherwise it will be
dropped.

In case1 L2's log necessarily includes p1 since it is present at a quorum
and without having it in the log its not possible to have a log more
up-to-date than that of a quorum / get the support of a quorum to become
leader.

Alex


On Tue, Sep 3, 2019 at 4:52 AM Norbert Kalmar 
wrote:

> Hi,
>
> That's a good question. So if I understand correctly, you are asking what
> happens if there is a new Leader Election in ZooKeeper, what is the "last
> seen zxid". I checked the ZAB protocol, it is not entirely clear for me as
> well, but my understanding is that the last seen zxid is the last
> transaction, which is read from txnlogs in case of a recovery. Honestly,
> there's nothing else this could be read from. So if it hasn't been
> committed to the datatree (and that exists in memory anyway, at least until
> a snapshot is taken), it is still the last txn that is logged by one of the
> followers, so he will win the Leader Election, and the followers will get
> this txn as well.
> Anyone agree/disagree? :)
>
> Regards,
> Norbert
>
> On Mon, Sep 2, 2019 at 4:50 AM 121476...@qq.com <121476...@qq.com> wrote:
>
> > hi, i'm a new to zookeeper, and this problem confuses me for nearly two
> > months...
> > papers tell me that zab must satisfy:
> > A message delivered by one sever must be delivered on quorum.
> > A message skipped must always be skipped.
> > Then consider two cases below, L is short for leader, F is short for
> > follower, p is short for proposal.
> > Case1:
> > L send p1 to F2 F3 F4 F5.
> > F2 F3 ack p1, reach a quorum.
> > L1 is about to send commit but failed...
> > L2 become new leader, he should commit.
> >
> > Case2:
> > L1 send p1 to F2 F3 F4 F5.
> > Only F2 ack p1, not reach a quorum.
> > Then L1 failed...
> > L2 become new leader, he should skip p1.
> >
> > i think L2 should handle the cases in election phase. but how L2
> > can know the global state and decide if commit p1 or skip p1?
> > if anyone helps, i will be much appreciate.
> >
> >
> >
> > 121476...@qq.com
> >
>


Re: About ZooKeeper Dynamic Reconfiguration

2019-08-21 Thread Alexander Shraer
That's great! Thanks for sharing.

> Added benefit is that we can also control which data center gets the
quorum
> in case of a network outage between the two.

Can you explain how this works? In case of a network outage between two
DCs, one of them has a quorum of participants and the other doesn't.
The participants in the smaller set should not be operational at this time,
since they can't get quorum. no ?

Thanks,
Alex

On Wed, Aug 21, 2019 at 7:55 AM Cee Tee  wrote:

> We have solved this by implementing a 'zookeeper cluster balancer', it
> calls the admin server api of each zookeeper to get the current status and
> will issue dynamic reconfigure commands to change dead servers into
> observers so the quorum is not in danger. Once the dead servers reconnect,
> they take the observer role and are then reconfigured into participants
> again.
>
> Added benefit is that we can also control which data center gets the
> quorum
> in case of a network outage between the two.
> Regards
> Chris
>
> On 21 August 2019 16:42:37 Alexander Shraer  wrote:
>
> > Hi,
> >
> > Reconfiguration, as implemented, is not automatic. In your case, when
> > failures happen, this doesn't change the ensemble membership.
> > When 2 of 5 fail, this is still a minority, so everything should work
> > normally, you just won't be able to handle an additional failure. If
> you'd
> > like
> > to remove them from the ensemble, you need to issue an explicit
> > reconfiguration command to do so.
> >
> > Please see details in the manual:
> > https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html
> >
> > Alex
> >
> > On Wed, Aug 21, 2019 at 7:29 AM Gao,Wei  wrote:
> >
> >> Hi
> >>I encounter a problem which blocks my development of load balance
> using
> >> ZooKeeper 3.5.5.
> >>Actually, I have a ZooKeeper cluster which comprises of five zk
> >> servers. And the dynamic configuration file is as follows:
> >>
> >>   server.1=zk1:2888:3888:participant;0.0.0.0:2181
> >>   server.2=zk2:2888:3888:participant;0.0.0.0:2181
> >>   server.3=zk3:2888:3888:participant;0.0.0.0:2181
> >>   server.4=zk4:2888:3888:participant;0.0.0.0:2181
> >>   server.5=zk5:2888:3888:participant;0.0.0.0:2181
> >>
> >>   The zk cluster can work fine if every member works normally. However,
> if
> >> say two of them are suddenly down without previously being notified,
> >> the dynamic configuration file shown above will not be synchronized
> >> dynamically, which leads to the zk cluster fail to work normally.
> >>   I think this is a very common case which may happen at any time. If
> so,
> >> how can we resolve it?
> >>   Really look forward to hearing from you!
> >> Thanks
> >>
>
>
>
>


Re: About ZooKeeper Dynamic Reconfiguration

2019-08-21 Thread Alexander Shraer
Hi,

Reconfiguration, as implemented, is not automatic. In your case, when
failures happen, this doesn't change the ensemble membership.
When 2 of 5 fail, this is still a minority, so everything should work
normally, you just won't be able to handle an additional failure. If you'd
like
to remove them from the ensemble, you need to issue an explicit
reconfiguration command to do so.

Please see details in the manual:
https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html

Alex

On Wed, Aug 21, 2019 at 7:29 AM Gao,Wei  wrote:

> Hi
>I encounter a problem which blocks my development of load balance using
> ZooKeeper 3.5.5.
>Actually, I have a ZooKeeper cluster which comprises of five zk
> servers. And the dynamic configuration file is as follows:
>
>   server.1=zk1:2888:3888:participant;0.0.0.0:2181
>   server.2=zk2:2888:3888:participant;0.0.0.0:2181
>   server.3=zk3:2888:3888:participant;0.0.0.0:2181
>   server.4=zk4:2888:3888:participant;0.0.0.0:2181
>   server.5=zk5:2888:3888:participant;0.0.0.0:2181
>
>   The zk cluster can work fine if every member works normally. However, if
> say two of them are suddenly down without previously being notified,
> the dynamic configuration file shown above will not be synchronized
> dynamically, which leads to the zk cluster fail to work normally.
>   I think this is a very common case which may happen at any time. If so,
> how can we resolve it?
>   Really look forward to hearing from you!
> Thanks
>


Re: Apache Zookeeper Bugs

2019-08-01 Thread Alexander Shraer
Thanks Xiaoqin! Would you be able to open a Jira for this and perhaps
submit a PR ?
https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute

On Thu, Aug 1, 2019 at 8:23 AM Xiaoqin Fu  wrote:

> Dear developers:
>  I am a Ph.D. student at Washington State University. I applied dynamic
> taint analyzer (distTaint) to Apache Zookeeper (version 3.4.11). And then I
> find several bugs, that exist from 3.4.11-3.4.14 and 3.5.5, from tainted
> paths:
> 1. In org.apache.zookeeper.server.ZooKeeperServer:
> public ZooKeeperServer(FileTxnSnapLog txnLogFactory, int tickTime,
> int minSessionTimeout, int maxSessionTimeout, ZKDatabase zkDb)
> {
> ..
> LOG.info("Created server with tickTime " + tickTime
> + " minSessionTimeout " + getMinSessionTimeout()
> + " maxSessionTimeout " + getMaxSessionTimeout()
> + " datadir " + txnLogFactory.getDataDir()
> + " snapdir " + txnLogFactory.getSnapDir());
> ..
> }
> public void finishSessionInit(ServerCnxn cnxn, boolean valid)
> ..
> if (!valid) {
> LOG.info("Invalid session 0x"
> + Long.toHexString(cnxn.getSessionId())
> + " for client "
> + cnxn.getRemoteSocketAddress()
> + ", probably expired");
> cnxn.sendBuffer(ServerCnxnFactory.closeConn);
> } else {
> LOG.info("Established session 0x"
> + Long.toHexString(cnxn.getSessionId())
> + " with negotiated timeout " +
> cnxn.getSessionTimeout()
> + " for client "
> + cnxn.getRemoteSocketAddress());
> cnxn.enableRecv();
> }
> ..
> }
> Sensitive information about DataDir, SnapDir, SessionId and
> RemoteSocketAddress may be leaked. I think that it is better to add
> LOG.isInfoEnabled() conditional statements:
>public ZooKeeperServer(FileTxnSnapLog txnLogFactory, int tickTime,
> int minSessionTimeout, int maxSessionTimeout, ZKDatabase zkDb)
> {
> ..
> if (LOG.isInfoEnabled())
> LOG.info("Created server with tickTime " + tickTime
> + " minSessionTimeout " + getMinSessionTimeout()
> + " maxSessionTimeout " + getMaxSessionTimeout()
> + " datadir " + txnLogFactory.getDataDir()
> + " snapdir " + txnLogFactory.getSnapDir());
> ..
> }
> public void finishSessionInit(ServerCnxn cnxn, boolean valid) {
> ..
> if (!valid) {
> if (LOG.isInfoEnabled())
> LOG.info("Invalid session 0x"
> + Long.toHexString(cnxn.getSessionId())
> + " for client "
> + cnxn.getRemoteSocketAddress()
> + ", probably expired");
> cnxn.sendBuffer(ServerCnxnFactory.closeConn);
> } else {
> if (LOG.isInfoEnabled())
> LOG.info("Established session 0x"
> + Long.toHexString(cnxn.getSessionId())
> + " with negotiated timeout " +
> cnxn.getSessionTimeout()
> + " for client "
> + cnxn.getRemoteSocketAddress());
> cnxn.enableRecv();
> }
> ..
> }
> The LOG.isInfoEnabled() conditional statement already exists in
> org.apache.zookeeper.server.persistence.FileTxnLog:
> public synchronized boolean append(TxnHeader hdr, Record txn) throws
> IOException {
> { ..
>   if(LOG.isInfoEnabled()){
> LOG.info("Creating new log file: " + Util.makeLogName(hdr.getZxid()));
>   }
> ..
> }
>
> 2. In org.apache.zookeeper.ClientCnxn$SendThread,
> void readResponse(ByteBuffer incomingBuffer) throws IOException {
> ..
> LOG.warn("Got server path " + event.getPath()
> + " which is too short for chroot path "
> + chrootPath);
> ..
> }
> Sensitive information about event path and chroot path may be leaked. The
> LOG.isWarnEnabled() conditional statement should be added:
>void readResponse(ByteBuffer incomingBuffer) throws IOException {
> ..
> if (LOG.isWarnEnabled())
> LOG.warn("Got server path " + event.getPath()
> + " which is too short for chroot path "
> + chrootPath);
> ..
> }
>
> 3. In org.apache.zookeeper.server.ZooTrace, there are two relevant methods
> which all use the same conditional statements:
> public static void logTraceMessage(Logger log, long mask, String msg) {
> if (isTraceEnabled(log, mask)) {
> log.trace(msg);
> }
> }
>
> static public void logQuorumPacket(Logger log, long mask,
> char direction, QuorumPacket qp)
> {
> if (isTraceEnabled(log, mask)) {
> logTraceMessage(log, mask, direction +
> " " + LearnerHandler.packetToString(qp));
>  }
> }
>
> 

Re: How to commit last epoch proposal in zab

2019-07-29 Thread Alexander Shraer
The commit is not actually written to the log. The log is updated before a
server ACKs a proposal - and what's in the log is what matters for
recovery.
In your example, server1 sent a commit for p2, so it got at least one ACK
from server2 or server3. Since in your example server2 has been elected, it
has the longest log,
so it has to have p2 in the log.  So p2 is going to be committed by
NEWLEADER.

I might not be remembering this completely accurately, hope someone can
correct me if I missed something.


On Sat, Jul 27, 2019 at 5:23 AM chen dongming  wrote:

> hi,
>
> I have a question about zab.
>
> server1(leader): p1, p2, c1, c2
>
> server2:p1, p2,c1
>
> server3:p1,p2
>
> At this time, server1 is down, server2 become leader.
>
> I read the code of LearnerHandler.java, I think p1, p2 in server3 can be
> committed by DIFF, NEWLEADER.
>
> But when is p2 in server2 committed?
>
>
>


Re: ZK 3.5.5 : SecureClientPort and Server Specs

2019-07-01 Thread Alexander Shraer
I think that Fred is correct - secureClientPort and secureClientPortAddress
were not made part of the dynamic configuration (yet ?), so unlike other
parameters, they are static.
Fred, perhaps you could open a Jira to ask for this feature ?

Thanks,
Alex

On Mon, Jul 1, 2019 at 2:58 PM Andor Molnar  wrote:

> Hi Fred,
>
> I don’t think this server spec is accurate.
> clientPort and clientPortAddress as well as secureClientPort and
> secureClientPortAddress are defined in the main section of config file, not
> within Cluster Options:
>
>
> https://zookeeper.apache.org/doc/r3.5.5/zookeeperAdmin.html#sc_configuration
> <
> https://zookeeper.apache.org/doc/r3.5.5/zookeeperAdmin.html#sc_configuration
> >
>
> e.g. You should have something like:
>
> clientPort=2181
> clientPortAddress=127.0.0.1
> secureClientPort=1181
> secureClientPortAddress=…
>
> server.1=…
> server.2=…
>
> In your zoo.cfg config file.
>
> Regards,
> Andor
>
>
>
> > On 2019. Jun 19., at 17:28, Fred Eisele 
> wrote:
> >
> > The server specification is ...
> > server. = ::[:role];[ > address>:]
> > The clientPort and clientPortAddress are accomodated but I do not see a
> > provision for secureClientPort.
> > I presume this means it is a static parameter as before?
>
>


Great talk from Ben Reed about the origins of ZooKeeper

2019-06-26 Thread Alexander Shraer
https://atscaleconference.com/videos/systems-scale-2019-welcome-keynote/


Re: majority of non-failing machines VS quorum

2019-05-30 Thread Alexander Shraer
yep, for odd n that's right.

On Thu, May 30, 2019 at 1:30 PM Joel Mestres 
wrote:

> ok great so the minimum quorum always is F + 1 considering n as odd, right
> ?
>
> On Thu, May 30, 2019 at 3:38 PM Alexander Shraer 
> wrote:
>
> > If you're using "majority quorums" (the default in ZK),
> F=floor((n-1)/2). A
> > quorum is any set containing a majority (or more) of servers.
> > The basic requirement is that any two quorums must intersect.
> >
> >
> > On Wed, May 29, 2019 at 6:55 PM Patrick Hunt  wrote:
> >
> > > 2n+1 = ensemble size required to survive n failed zkservers (servers
> not
> > in
> > > the quorum)
> > >
> > > iow: 3 nodes means 1 zkserver can fail and the service is still up. 5
> and
> > > you can survive 2 failures.
> > >
> > > Patrick
> > >
> > > On Wed, May 29, 2019 at 4:43 PM Joel Mestres  >
> > > wrote:
> > >
> > > > Hello which is the relation between the number of F failling machines
> > > that
> > > > the cluster can tolerate and the quorum configuration? F determines
> the
> > > > posibles numbers of quorums? or F can be greater / smaller than
> quorum
> > ?
> > > > thanks in advance for your response!
> > > >
> > > > --
> > > >
> > > >
> > > > The information contained in this e-mail may be confidential. It has
> > been
> > > > sent for the sole use of the intended recipient(s). If the reader of
> > this
> > > > message is not an intended recipient, you are hereby notified that
> any
> > > > unauthorized review, use, disclosure, dissemination, distribution or
> > > > copying of this communication, or any of its contents, is strictly
> > > > prohibited. If you have received it by mistake please let us know by
> > > > e-mail
> > > > immediately and delete it from your system. Many thanks.
> > > >
> > > >
> > > >
> > > > La información
> > > > contenida en este mensaje puede ser confidencial. Ha sido enviada
> para
> > el
> > > > uso exclusivo del destinatario(s) previsto. Si el lector de este
> > mensaje
> > > > no
> > > > fuera el destinatario previsto, por el presente queda Ud. notificado
> > que
> > > > cualquier lectura, uso, publicación, diseminación, distribución o
> > copiado
> > > > de esta comunicación o su contenido está estrictamente prohibido. En
> > caso
> > > > de que Ud. hubiera recibido este mensaje por error le agradeceremos
> > > > notificarnos por e-mail inmediatamente y eliminarlo de su sistema.
> > Muchas
> > > > gracias.
> > > >
> > > >
> > >
> >
>
> --
>
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents, is strictly
> prohibited. If you have received it by mistake please let us know by
> e-mail
> immediately and delete it from your system. Many thanks.
>
>
>
> La información
> contenida en este mensaje puede ser confidencial. Ha sido enviada para el
> uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje
> no
> fuera el destinatario previsto, por el presente queda Ud. notificado que
> cualquier lectura, uso, publicación, diseminación, distribución o copiado
> de esta comunicación o su contenido está estrictamente prohibido. En caso
> de que Ud. hubiera recibido este mensaje por error le agradeceremos
> notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas
> gracias.
>
>


Re: majority of non-failing machines VS quorum

2019-05-30 Thread Alexander Shraer
If you're using "majority quorums" (the default in ZK), F=floor((n-1)/2). A
quorum is any set containing a majority (or more) of servers.
The basic requirement is that any two quorums must intersect.


On Wed, May 29, 2019 at 6:55 PM Patrick Hunt  wrote:

> 2n+1 = ensemble size required to survive n failed zkservers (servers not in
> the quorum)
>
> iow: 3 nodes means 1 zkserver can fail and the service is still up. 5 and
> you can survive 2 failures.
>
> Patrick
>
> On Wed, May 29, 2019 at 4:43 PM Joel Mestres 
> wrote:
>
> > Hello which is the relation between the number of F failling machines
> that
> > the cluster can tolerate and the quorum configuration? F determines the
> > posibles numbers of quorums? or F can be greater / smaller than quorum ?
> > thanks in advance for your response!
> >
> > --
> >
> >
> > The information contained in this e-mail may be confidential. It has been
> > sent for the sole use of the intended recipient(s). If the reader of this
> > message is not an intended recipient, you are hereby notified that any
> > unauthorized review, use, disclosure, dissemination, distribution or
> > copying of this communication, or any of its contents, is strictly
> > prohibited. If you have received it by mistake please let us know by
> > e-mail
> > immediately and delete it from your system. Many thanks.
> >
> >
> >
> > La información
> > contenida en este mensaje puede ser confidencial. Ha sido enviada para el
> > uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje
> > no
> > fuera el destinatario previsto, por el presente queda Ud. notificado que
> > cualquier lectura, uso, publicación, diseminación, distribución o copiado
> > de esta comunicación o su contenido está estrictamente prohibido. En caso
> > de que Ud. hubiera recibido este mensaje por error le agradeceremos
> > notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas
> > gracias.
> >
> >
>


Re: Dynamic Config

2019-05-30 Thread Alexander Shraer
Hi,

1. Right - only the configuration parameters that live in the dynamic file
are controlled by dynamic reconfig. The dynamic files are
kept in sync across all the ZK servers, whereas the static files may not be
the same.

There is a backward compatibility mode, where you start up a server without
a dynamic file, and ZK copies over whatever
it can from the static to the dynamic file. From that point, you're not
supposed to manually change the dynamic file - ZK
manages that for you, and you only affect the configuration via reconfig
commands.

2. Dynamic files are written out upon commit of new configurations created
via reconfig, or, more precisely, when a server learns about such a commit.
The number is the zxid of the commit.

3. I don't think there's any purge job that was implemented, so the old
copies will

4. There is a fixed set of things that can live in the dynamic file. You
can't just put anything there, because ZK still looks for other config
parameters in the static file.

Please see details in the manual:
https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html

Thanks,
Alex

On Thu, May 30, 2019 at 10:49 AM rammohan ganapavarapu <
rammohanga...@gmail.com> wrote:

> Hi,
>
> One more question
>
> 4. Is there any list of configs that should be only in static file to boot
> up the zookeeper? or can i have some thing like this?
>
> cat zoo.cfg
> dynamicConfigFile=/opt/zookeeper/conf/dynamic.cfg
>
> cat dynamic.cfg
> # All zookeeper configurations
>
> will this work?
>
> On Thu, May 30, 2019 at 9:59 AM rammohan ganapavarapu <
> rammohanga...@gmail.com> wrote:
>
> > Hi,
> >
> > I have  few questions regarding dynamic reconfig feature,
> >
> > 1. this feature can only reconfigure the properties or configuration
> > defined in dynamic configuration file and not the configs in static
> default
> > zoo.cfg file?
> > 2. What is the criteria to create version extension for dynamic config
> > file? ex: zoo.cfg.dynamic.1, i mean when does zk create new
> version
> > file, i have change a property in static file and restarted zk but it
> didnt
> > create new version file so it will only create new version when a config
> in
> > dynamic file changes?
> > 3. How many copies/versions of these dynamic config files will get
> created
> > and is there a purge task that zk runs to cleanup older version files?
> >
> >
> >
> > Thanks,
> > Ram
> >
>


Re: Is it safe to reuse zookeeper replica ID when reprovisioning?

2019-04-01 Thread Alexander Shraer
Just wanted to add that it's not important to wait until the replaced node
has fully synced - what's important is to wait until a quorum that doesn't
include it has the latest data before starting the replacement process
(which is like manually loosing data).

So, you could logically remove it (this makes it non-voting, and makes sure
that a quorum that doesn't include it is up-to-date). Then you can
immediately add it back, even if it isn't fully synced yet. This is
probably also better to do in case you do have a failure - if C fails never
recovers but A has the latest data and B is a voter then B can recover from
A and they can continue normally.




On Mon, Apr 1, 2019 at 5:46 PM Alexander Shraer  wrote:

> Lets say you have nodes A, B, C. Only B and C have latest data. You're
> trying to replace B.
> You replace B with a new server but before its in sync, C fails. What
> happens ?
>
> Option 1 (no reconfiguration): A and B are both registered as voting
> members, they form a majority out of 3, B syncs from A and they happily
> continue together. Since neither have the latest data, this is data loss.
> Option 2 (with reconfiguration): By logically removing B first, you're
> bringing A up to date. So A and C both have the latest data now. A is going
> to be stalled while C is down and will not form a quorum with B, since B
> isn't registered to be able to vote. If C never recovers, you can recover
> manually by updating config files.
>
>
> On Mon, Apr 1, 2019 at 5:10 PM David Anderson  wrote:
>
>> On Mon, Apr 1, 2019 at 4:48 PM Alexander Shraer 
>> wrote:
>>
>> > Hi,
>> >
>> > I think that one of the problems with the proposed method is that you
>> may
>> > end-up having a majority of servers that don't have the latest state
>> > (imagine that there is a minority failure while your replaced
>> > node hasn't been brought up do date yet).
>>
>>
>> > Have you considered using dynamic reconfiguration ? Removing the nodes
>> > logically first, then replacing them and adding back in ? You can do
>> > multiple servers at a time this way.
>>
>>
>> Does dynamic reconfiguration as you suggest here buy me anything in a
>> 3-node cluster? No matter what I'm going to be at N+0 during the
>> transition, so doesn't it just add more steps for the same result?
>
>
>> Or, you can give new servers higher ids, add them using reconfig, and
>> later
>> > remove the old servers. Reconfiguration ensures that a quorum always has
>> > the data.
>> >
>>
>> My admittedly terrible motivation for avoiding that is that I want to
>> preserve hostnames, to avoid reconfiguring clients. This is in a cloud
>> environment where DNS is tied to instance name, so I can't play tricks at
>> the network layer - at some point I have to delete the old instances and
>> set up new ones with the same name. I suppose I could do a careful dance
>> where I grow to 5 nodes, then do a rolling removal/readd of the first 3,
>> so
>> that I can stay at N+1 during the replacement, and just trust that clients
>> can reach at least one of the first 3 replicas to discover the entire
>> cluster.
>>
>> - Dave
>>
>>
>> > Alex
>> >
>> >
>> >
>> > On Mon, Apr 1, 2019 at 2:51 PM David Anderson  wrote:
>> >
>> > > Hi,
>> > >
>> > > I have a running Zookeeper (3.5) cluster where the machines need to be
>> > > replaced. I was thinking of just setting the same ID on each new
>> > > machine, and then doing a rolling replacement: take down old ID 1,
>> > > start new ID 1, let it rejoin the cluster and replicate the state,
>> > > then continue with the other replicas.
>> > >
>> > > I'm finding conflicting information on the internet about the safety
>> > > of this. The Apache Kafka FAQ says to do exactly this when replacing a
>> > > failed Zookeeper replica, and the new machine will just replicate the
>> > > state before participating in the quorum. Other places on the internet
>> > > say that reusing the ID without also copying over the state directory
>> > > will break assumptions that ZAB makes about replicas, with bad (but
>> > > nondescript) consequences.
>> > >
>> > > So, is it safe to reuse IDs in the way I described? If not, what's the
>> > > suggested procedure for a rolling replacement of all cluster replicas?
>> > >
>> > > Thanks,
>> > > - Dave
>> > >
>> >
>>
>


Re: Is it safe to reuse zookeeper replica ID when reprovisioning?

2019-04-01 Thread Alexander Shraer
Lets say you have nodes A, B, C. Only B and C have latest data. You're
trying to replace B.
You replace B with a new server but before its in sync, C fails. What
happens ?

Option 1 (no reconfiguration): A and B are both registered as voting
members, they form a majority out of 3, B syncs from A and they happily
continue together. Since neither have the latest data, this is data loss.
Option 2 (with reconfiguration): By logically removing B first, you're
bringing A up to date. So A and C both have the latest data now. A is going
to be stalled while C is down and will not form a quorum with B, since B
isn't registered to be able to vote. If C never recovers, you can recover
manually by updating config files.


On Mon, Apr 1, 2019 at 5:10 PM David Anderson  wrote:

> On Mon, Apr 1, 2019 at 4:48 PM Alexander Shraer  wrote:
>
> > Hi,
> >
> > I think that one of the problems with the proposed method is that you may
> > end-up having a majority of servers that don't have the latest state
> > (imagine that there is a minority failure while your replaced
> > node hasn't been brought up do date yet).
>
>
> > Have you considered using dynamic reconfiguration ? Removing the nodes
> > logically first, then replacing them and adding back in ? You can do
> > multiple servers at a time this way.
>
>
> Does dynamic reconfiguration as you suggest here buy me anything in a
> 3-node cluster? No matter what I'm going to be at N+0 during the
> transition, so doesn't it just add more steps for the same result?


> Or, you can give new servers higher ids, add them using reconfig, and later
> > remove the old servers. Reconfiguration ensures that a quorum always has
> > the data.
> >
>
> My admittedly terrible motivation for avoiding that is that I want to
> preserve hostnames, to avoid reconfiguring clients. This is in a cloud
> environment where DNS is tied to instance name, so I can't play tricks at
> the network layer - at some point I have to delete the old instances and
> set up new ones with the same name. I suppose I could do a careful dance
> where I grow to 5 nodes, then do a rolling removal/readd of the first 3, so
> that I can stay at N+1 during the replacement, and just trust that clients
> can reach at least one of the first 3 replicas to discover the entire
> cluster.
>
> - Dave
>
>
> > Alex
> >
> >
> >
> > On Mon, Apr 1, 2019 at 2:51 PM David Anderson  wrote:
> >
> > > Hi,
> > >
> > > I have a running Zookeeper (3.5) cluster where the machines need to be
> > > replaced. I was thinking of just setting the same ID on each new
> > > machine, and then doing a rolling replacement: take down old ID 1,
> > > start new ID 1, let it rejoin the cluster and replicate the state,
> > > then continue with the other replicas.
> > >
> > > I'm finding conflicting information on the internet about the safety
> > > of this. The Apache Kafka FAQ says to do exactly this when replacing a
> > > failed Zookeeper replica, and the new machine will just replicate the
> > > state before participating in the quorum. Other places on the internet
> > > say that reusing the ID without also copying over the state directory
> > > will break assumptions that ZAB makes about replicas, with bad (but
> > > nondescript) consequences.
> > >
> > > So, is it safe to reuse IDs in the way I described? If not, what's the
> > > suggested procedure for a rolling replacement of all cluster replicas?
> > >
> > > Thanks,
> > > - Dave
> > >
> >
>


Re: Is it safe to reuse zookeeper replica ID when reprovisioning?

2019-04-01 Thread Alexander Shraer
Hi,

I think that one of the problems with the proposed method is that you may
end-up having a majority of servers that don't have the latest state
(imagine that there is a minority failure while your replaced
node hasn't been brought up do date yet).

Have you considered using dynamic reconfiguration ? Removing the nodes
logically first, then replacing them and adding back in ? You can do
multiple servers at a time this way.
Or, you can give new servers higher ids, add them using reconfig, and later
remove the old servers. Reconfiguration ensures that a quorum always has
the data.

Alex



On Mon, Apr 1, 2019 at 2:51 PM David Anderson  wrote:

> Hi,
>
> I have a running Zookeeper (3.5) cluster where the machines need to be
> replaced. I was thinking of just setting the same ID on each new
> machine, and then doing a rolling replacement: take down old ID 1,
> start new ID 1, let it rejoin the cluster and replicate the state,
> then continue with the other replicas.
>
> I'm finding conflicting information on the internet about the safety
> of this. The Apache Kafka FAQ says to do exactly this when replacing a
> failed Zookeeper replica, and the new machine will just replicate the
> state before participating in the quorum. Other places on the internet
> say that reusing the ID without also copying over the state directory
> will break assumptions that ZAB makes about replicas, with bad (but
> nondescript) consequences.
>
> So, is it safe to reuse IDs in the way I described? If not, what's the
> suggested procedure for a rolling replacement of all cluster replicas?
>
> Thanks,
> - Dave
>


Re: Zookeeper syncing with Curator

2019-03-18 Thread Alexander Shraer
> I have to make sure that a read always reflects *all previous writes*
(which might be performed on another
zookeeper server and has not reached all other instances).

By doing a sync before reading, as you say, the read should indeed reflect
all *completed* previous writes, i.e., writes that were acknowledged to the
client issuing them,
even if some of the ZK replicas didn't receive them yet.

There is a caveat here, which is that the current implementation of sync
doesn't involve a quorum, and therefore its correctness is dependent on
certain timing assumptions.
Under some (hopefully very rare) leader replacement scenarios, sync might
not reflect the latest data in the system. There is a JIRA to fix this:
https://issues.apache.org/jira/browse/ZOOKEEPER-2136

I believe that if you issue a read after a sync, your read will be queued
at the local ZK Server until the sync completes, and only executed at that
time, you don't need to wait for sync completion before enqueuing the read.
The sync does not explicitly transfer data, its just a way to "flush" all
previous updates from the leader to your local server. So when the server
hears back a sync response, it knows that it also has all previous updates.

If I recall correctly, currently there is no effect on the path you specify
in sync, so it just brings all the data of your local server up-to-date. I
doubt that will change but I mat be wrong. Syncing "/" is probably safest
even if something changes in the future.

> ZooKeeper is an eventually consistent system.

I have to disagree with Jordan a bit here. ZooKeeper is a strongly
consistent system, it is implemented using a variant of Paxos. From the
perspective of an individual replica, sure, the data is propagated
eventually. But strong / weak consistency of a system is usually determined
by considering the semantics of the API it provides to clients. If you just
do reads+writes you get "sequential consistency". If you do sync+read,
you'll get linearizability (if the JIRA above is fixed).  RDBMS provides
different abstractions (transactions, queries, secondary indices). ZK only
deals with individual operations and batches not interactive transactions.
But for these you do get strong semantics in ZK.

> In a dynamic ensemble with lots of concurrent reads/writes there is no
such thing a read reflecting all active writes.

I think that the key here, is that ZK allows you to do strong reads
(sync+read), which will reflect all *completed* writes. Not active writes
(not sure how these would be defined). Dynamic reconfiguration was designed
not to change the properties of a static ZK ensemble.


Alex


On Fri, Mar 15, 2019 at 3:55 PM Jordan Zimmerman 
wrote:

> Curator does nothing additional with sync. Sync is a feature of ZooKeeper
> not Curator. Curator merely exposes an API for it.
>
> -JZ
>
> > On Mar 14, 2019, at 9:35 AM, Robin Wolters 
> > 
> wrote:
> >
> > That is indeed an option, thanks.
> >
> > But for my own curiosity, how does the sync operation behave for Curator?
> > 1) Does it also sync the child nodes of the specified path?
> > 2) Does it sync (transfer data for) a node even if it was up to date?
> > 3) In Curator, would I have to wait for the callback of sync or can I
> > just use sync and go ahead, knowing the next operation is queued?
> >
> > Regards,
> > Robin
> >
> > On Wed, 13 Mar 2019 at 17:07, Jordan Zimmerman
> >  wrote:
> >>
> >> It sounds like you’re describing one of the Barrier recipes. Curator
> has several. I’d look to those as a possible solution.
> >>
> >> 
> >> Jordan Zimmerman
> >>
> >>> On Mar 13, 2019, at 9:56 AM, Robin Wolters 
> >>> 
> wrote:
> >>>
> >>> Thanks for the reply. I understand that this is not possible in
> general.
> >>>
> >>> In my case the read and write are started from the same overarching
> >>> application (but different zookeeper connections and hence possibly
> >>> different nodes).
> >>> I start the read only after I know the write has succeeded, but I
> >>> don't know if it has reached all nodes yet.
> >>> So I expected that a sync gives me the guarantee that the next read
> >>> reflects at least this specific write.
> >>> It's okay if possible further writes are not in yet.
> >>>
> >>> Is this "selective" consistency not possible with my approach?
> >>>
> >>> Best regards,
> >>> Robin
> >>>
> >>> On Wed, 13 Mar 2019 at 15:47, Jordan Zimmerman
> >>>  wrote:
> 
>  ZooKeeper is an eventually consistent system. Reads are always
> consistent in that they reflect previous writes, however it is not possible
> to do what you describe. Reads are fulfilled by the Node your client is
> connected to. Writes are always through the leader Node. In a dynamic
> ensemble with lots of concurrent reads/writes there is no such thing a read
> reflecting all active writes.
> 
>  You should consider a RDBMS like MySQL instead of something like
> ZooKeeper.
> 
>  
>  Jordan Zimmerman
> 
> > On Mar 

Re: test zookeeper observer

2018-10-26 Thread Alexander Shraer
Hi, look at that server’s log - it should say that it is observing.
On Fri, Oct 26, 2018 at 5:21 AM lamriq  wrote:

> Hello
>
> I add a new server Zookeeper as observer, but I am not sure if it's work
> well or not, how can I test if the observer send OBSERVERINFO and don't
> vote.
>
> Regards
> Rabii
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>


Re: dynamic config file number

2018-06-18 Thread Alexander Shraer
The way it was implemented, is that the version (which is printed in your
log, like version=1f001cc8d5) is not stored in the
dynamic config file, but is actually part of its file name. It corresponds
to the zxid at which the configuration was committed.
You should never change that manually, or copy it from a different cluster.
Instead you should either start with a static config file
which will then be automatically converted to a dynamic one, or with an
un-numbered dynamic one, as you suggest.
https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#sc_reconfig_file

I don't remember exactly, but I'm guessing that when a server boots, it
uses the version in the file name to bootstrap its config info.
Then, when you reconfig, the zxid of the reconfig (which is also the
version of the new config) is lower than the config version your cluster
has (probably the new cluster committed less ops than the previous one, so
its zxid is smaller)
so it fails with an error that the config is stale (has lower zxid /
version than the one the server already has).


Alex

On Mon, Jun 18, 2018 at 8:04 AM, oo4load  wrote:

> I had a problem getting dynamic reconfig to work on new / clean clusters,
> if
> I copied the zoo.cfg and zoo.cfg.dynamic.(number) file over from an older
> installation.
>
>
> Here's what happens:
>
> [zk: localhost:2181(CONNECTED) 2] config
> server.1=srv5703h:2888:3888:participant;0.0.0.0:2181
> server.2=srv5703k:2888:3888:participant;0.0.0.0:2181
> server.3=srv5704y:2888:3888:participant;0.0.0.0:2181
> version=1f001cc8d5
>
> [zk: localhost:2181(CONNECTED) 3] reconfig -remove 3
> Committed new configuration:
> server.1=srv5703h:2888:3888:participant;0.0.0.0:2181
> server.2=srv5703k:2888:3888:participant;0.0.0.0:2181
> server.3=srv5704y:2888:3888:participant;0.0.0.0:2181
> version=1f001cc8d5
>
>
> As you can see the config version doesnt change.
> If you check the filesystem, on each Zookeeper a ".next" file is created
> with the new config, but it seems like it's never committed.
>
> -rw-r-. 1 prof prof 282 Jun 18 12:39 zoo.cfg
> -rw-r-. 1 prof prof 159 Jun 18 15:25 zoo.cfg.dynamic.1f001cc8d5
> -rw-r-. 1 prof prof 123 Jun 18 15:26 zoo.cfg.dynamic.next
>
>
> On the Zookeepers where the reconfig command was NOT run, the logs show the
> following message:
> 2018-06-18 15:26:56,491 [myid:3] - INFO  [ProcessThread(sid:3
> cport:-1)::PrepRequestProcessor@476] - Incremental reconfig
> 2018-06-18 15:26:56,493 [myid:3] - ERROR [ProcessThread(sid:3
> cport:-1)::QuorumPeer@1460] - setLastSeenQuorumVerifier called with stale
> config 4294967306. Current version: 133145872597
>
>
> After growing a ton of grey hairs we figured out that a new cluster must
> start with an "unnumbered" dynamic config file, and copying over an
> existing
> config always fails. Can anyone explain why that is ?
>
> Thanks,
>
> Chris
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>


Re: Is the value of $MYID allowed to change across runs in an HA ZK deployment?

2018-02-05 Thread Alexander Shraer
Hi Jay,

Perhaps it also depends on the restart? if the restart is done gradually,
for example a leader is in the middle of collecting votes when one of the
voters gets a new id and votes twice instead of once ? If the restart is a
barrier, where all servers are shut down and then restarted, this shouldn't
happen.

In 3.5, cluster membership is written into the ZK database as well as
configuration files, and contains server ids and parameters (ports, IPs,
etc). If ids change, it sounds like the membership
information may be wrong.

Perhaps there are also some implications on the security-related configs ?
Someone else may want to comment on these.

In general, changing ids doesn't feel like a very safe method to me...

Cheers,
Alex




On Mon, Feb 5, 2018 at 10:47 AM,  wrote:

> Greetings Zookeepers,
>
> I'm investigating possible ways for Zookeeper to run safely on top of
> Kubernetes clusters.
>
> When the zookeeper containers come online, the value for $MYID is
> initially derived from the Kubernetes pod name.  All active pod names
> are guaranteed to be unique within the cluster at any given point in time.
>
> Example values:
>
> - zookeeper-0
> - zookeeper-1
> - zookeeper-2
>
> and the formula for $MYID is ((the trailing number of the pod name) + 1):
>
> - zookeeper-0 => $MYID=1
> - zookeeper-1 => $MYID=2
> - zookeeper-2 => $MYID=3
>
> The part I'm uncertain of is the relationship between $MYID and ensuring
> each zookeeper data set stays in sync with the rest of the cluster,
> particularly across container restarts.  Restarts can lead to Zookeeper
> data set being launched with a different value of $MYID compared with
> the previous run.  I.e., Zookeeper may have already run on any given
> data set in the past when the myid file contained a different value.
>
> Is it part of the mechanism used to ensure all follower members are in
> sync with the current leader?  It seems to me that if the leader (or
> followers) keep track of their peers via myid and it gets changed, there
> could be problems.
>
> Initial testing (without much load) has gone fine and things seem to
> work fine when launched with updated $MYID values.  I've also been
> perusing the ZK source code and inspecting how myid is used, and nothing
> stood out to indicate that this will lead to future problems.  However,
> experience dictates that with distributed systems the devil is often in
> nuanced details, so I'm hoping the experts out there may be able to shed
> light about the internal dependencies on the value of myid.
>
> Specific questions:
>
> - Is myid relied on to never change, or does it only need to be
> unique within the cluster at any given time?
>
> - What are the risks with changing myid in relation to ZK data set
> directories across runs?
>
> Your insights will be greatly appreciated!
>
> Kind regards,
> Jay Taylor
>
>
>


Re: how zookeeper promise FIFO client order

2017-11-14 Thread Alexander Shraer
Hi,

Specific implementations of Raft may guarantee client program order, but I
don't think that it directly follows from tcp order + state machine.
It matters whether commands are committed to the log according to program
order. For example, here's an implementation that seems
to be doing this:
http://atomix.io/copycat/docs/client-interaction/#preserving-program-order In
any case, this is probably not the right forum
for Raft questions :)

I'm not sure we want to do command queueing on the leader like in the link
above or maybe just reset the client session if we're missing requests. In
any case,
perhaps this is worth a discussion on a JIRA. What do others think ?
Baotiao, would you like to open one ?

Thanks,
Alex


On Tue, Nov 14, 2017 at 2:41 AM, baotiao  wrote:

> Hi Andor
>
> Another question is that if zookeeper only promise channel FIFO order, I
> think raft that build upon tcp also promise FIFO client order, since ther
> FIFO order is promise by the tcp connection, and almost all consensus
> algorithm apply the log to state machine in order.
>
> if I misunderstand any thing, please tell me.
>
>
>
> 
>
> Github: https://github.com/baotiao
> Blog: http://baotiao.github.io/
> Stackoverflow: http://stackoverflow.com/users/634415/baotiao
> Linkedin: http://www.linkedin.com/profile/view?id=145231990
>
> > On 14 Nov 2017, at 13:08, Andor Molnar  wrote:
> >
> > Oh, I see your problem now.
> > Abe is right, you can find find the best answer in the book and the short
> > answer is, yes, it only promises channel fifo order.
> >
> > Regards,
> > Andor
> >
> >
> > On Tue, Nov 14, 2017 at 4:04 AM, baotiao  wrote:
> >
> >> Hello Abraham
> >>
> >> right, exactly.
> >>
> >> my confusion is that the client FIFO order is for a client or only for a
> >> tcp connection
> >> 
> >>
> >> Github: https://github.com/baotiao
> >> Blog: http://baotiao.github.io/
> >> Stackoverflow: http://stackoverflow.com/users/634415/baotiao
> >> Linkedin: http://www.linkedin.com/profile/view?id=145231990
> >>
> >>> On 14 Nov 2017, at 08:12, Abraham Fine  wrote:
> >>>
> >>> Hello-
> >>>
> >>> My understanding is that the question is about the possibility of a
> race
> >>> condition between two client requests. I would take a look at the
> >>> section "Order in the Presence of Connection Loss" in the "ZooKeeper:
> >>> Distributed Process Coordination" book for the best answer to this
> >>> question.
> >>>
> >>> Thanks,
> >>> Abe
> >>>
> >>> On Mon, Nov 13, 2017, at 06:17, Andor Molnar wrote:
>  Hi baotiao,
> 
>  First, requests are acknowledged back to the client once the leader
>  accepted and written them in its transaction log, which guarantees
> that
>  in
>  case of a crash, pending transactions can be processed on restart.
>  Transactions IDs (zxid) are incremental and generated by the leader.
>  Second, Zab guarantees that if the leader broadcast T and T' in that
>  order,
>  each server must commit T before committing T'.
> 
>  With these 2 promises, I believe, that FIFO is guaranteed by
> Zookeeper.
> 
>  Would you please clarify that what do you mean by "set b=1 operation
> is
>  on
>  the way"?
> 
>  If "set b=1" is accepted by the leader, the client won't have to
> resend
>  it
>  on reconnect.
> 
>  Regards,
>  Andor
> 
> 
>  On Mon, Nov 13, 2017 at 5:01 AM, 陈宗志  wrote:
> 
> > I want to know in the following situation, how zookeeper promise
> client
> > FIFO order.
> >
> > the client sent three operation to server, set a = 1, set b = 1, set
> >> ready
> > = true.
> >
> > is it possible to this situation that the set a = 1 is process by the
> > leader, then there is something wrong with this tcp connection, this
> >> client
> > reconnect a new tcp connection to the leader, but the set b = 1
> >> operation
> > is on the way. then the client will use the new tcp connection to
> sent
> >> set
> > ready = true operation. so the set a = 1 is operated, set b = 1 is
> not
> >> and
> > set ready = true is operated too.
> >
> > the question is how zab promise client FIFO order?
> >
> > zab can resend all the operation that hasn't be replied from the
> >> leader.
> > then in this situation, when the client reconnect to the leader, it
> >> will
> > resent the operation set b = 1, set ready = true.
> >
> > is this the way the zab used to primise FIFO order?
> >
> > Thank you all
> >
> > --
> > ---
> > Blog: http://www.chenzongzhi.info
> > Twitter: https://twitter.com/baotiao  baotiao
> >>>
> > Git: https://github.com/baotiao
> >
> >>
> >>
>
>


Re: Any way to get information about cluster in CLI mode

2017-11-09 Thread Alexander Shraer
In 3.5 there is also the "config" CLI command described here:
https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html#sc_reconfig_retrieving


Alex

On Tue, Nov 7, 2017 at 11:34 AM, Abraham Fine  wrote:

> Hi Pavel-
>
> The ZooKeeper CLI does not have a way to get information about the
> cluster. Although, there are other ways to get that information. You can
> use four letter words
> (https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.
> html#sc_zkCommands)
> or JMX (https://zookeeper.apache.org/doc/r3.4.10/zookeeperJMX.html).
>
> If you are running a version >= 3.5.0 you can use the AdminServer
> (https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperAdmin.html#sc_
> adminserver),
> which provides the same functionality as four letter words.
>
> Regarding your second question, it is ok to not include all of the nodes
> on the cluster. Just be aware that a client will only be aware of the
> nodes provided in the connection string and will not be able to connect
> if all of the nodes provided are unavailable.
>
> Hope this helps.
>
> Thanks,
> Abe
>
> On Tue, Nov 7, 2017, at 05:58, Pavel Drankov wrote:
> > Hi,
> >
> > I'd like to know is there any way to get information about cluster(amount
> > of nodes e.g.) from Zookeeper CLI? If not what should be used instead?
> >
> > One more question: when providing connectString(comma separated
> > addresses)
> > is that okay to not provide in the string all nodes in a cluster?
> >
> > Best wishes,
> > Pavel
>


Re: Zookeeper 3.5.3 reconfig blocked by ACL

2017-10-17 Thread Alexander Shraer
Hi,

Please look for "sc_reconfig_access_control"
Here:
https://github.com/apache/zookeeper/blob/master/docs/zookeeperReconfig.html

Thanks,
Alex

On Tue, Oct 17, 2017 at 3:18 AM, oo4load  wrote:

> I have a 3.5.3 cluster where I am trying out the reconfig command. I am
> running with reconfigEnabled=true.
> When I try reconfig I run into an issue with ACL.
>
> [zk: localhost:2181(CONNECTED) 9] reconfig -remove 2
> Authentication is not valid :
>
> The config node is protected:
> [zk: localhost:2181(CONNECTED) 6] getAcl /zookeeper/config
> 'world,'anyone
> : r
>
>
> The way this is set up it seems only a superuser enabled cluster can use
> the
> reconfig command. Is that true, or am I missing something ? The
> documentation never mentioned it.
>
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>


Re: ZooKeeper Time Synchronization

2017-07-21 Thread Alexander Shraer
The general idea is to use time for availability but not correctness. So a
leader could be suspected as failed which may make the system unavailable
until a new one is elected but consistency is not affected.

Alex
On Fri, Jul 21, 2017 at 1:56 PM Michael Han  wrote:

> One clarification on "System Time" here - ZK uses two type of time/clock
>
> * The wall-clock time, which is recorded as part of zNode stats such as
> mtime and is exposed to users.
> * The monotonic clock which ZK uses in various uses (e.g. failure
> detection) to measure intervals. Note in 3.4 ZK still uses wall-clock for
> interval measuring so you may see interesting behavior when your system
> time change, but this will be fixed in the coming 3.4.11 release.
>
> On Fri, Jul 21, 2017 at 11:48 AM, Sandeep Singh 
> wrote:
>
> > Adding the Amr question.
> > Few things which I want to add is:
> >
> > Does zookeeper uses System Time for below things
> > 1) Leader election
> > 2) Deciding a slave is available/alive or not.
> > 3) Deciding leader/master is alive or not.
> > 4) Deciding a transaction timeout etc.
> > 5) Ordering the transaction etc.
> >
> > regards,
> > Sandeep.
> >
> >
> >
> > --
> > View this message in context: http://zookeeper-user.578899.
> > n2.nabble.com/ZooKeeper-Time-Synchronization-tp7583217p7583223.html
> > Sent from the zookeeper-user mailing list archive at Nabble.com.
> >
>
>
>
> --
> Cheers
> Michael.
>


Re: ZooKeeper Time Synchronization

2017-07-21 Thread Alexander Shraer
As far as I understand:

1) no
2) yes
3) yes
4) yes
5) no, except for the sync command (there is a jira open for that)

Others please correct me if I'm wrong


Thanks
Alex

On Fri, Jul 21, 2017 at 11:52 AM Sandeep Singh 
wrote:

> Adding the Amr question.
> Few things which I want to add is:
>
> Does zookeeper uses System Time for below things
> 1) Leader election
> 2) Deciding a slave is available/alive or not.
> 3) Deciding leader/master is alive or not.
> 4) Deciding a transaction timeout etc.
> 5) Ordering the transaction etc.
>
> regards,
> Sandeep.
>
>
>
> --
> View this message in context:
> http://zookeeper-user.578899.n2.nabble.com/ZooKeeper-Time-Synchronization-tp7583217p7583223.html
> Sent from the zookeeper-user mailing list archive at Nabble.com.
>


Re: gracefully remove a node from the ensamble

2017-07-14 Thread Alexander Shraer
Well, first of all you need to bootstrap a system - so all the nodes should
know of each other. This hasn't changed in 3.5.
When you add a new server, you also need to bootstrap its config file with
something (there are a few suggestions in the manual) - it doesn't need to
be the latest config but it has to include the leader and it must be
specified in a way that avoids a split brain. Once the new server talks
with the leader,
it syncs the latest configuration (something like what you're saying). Then
you can issue a command to formally add it to the cluster.


On Fri, Jul 14, 2017 at 2:09 PM, Luigi Tagliamonte <
luigi.tagliamont...@gmail.com> wrote:

> Thank you for the reply
> I was thinking that this whole automatic reconfiguration was something like
> in Cassandra...you have a seed node and when a new node boots it get the
> info from the seed. Is something like that available?
> Regards
> L.
>


Re: gracefully remove a node from the ensamble

2017-07-14 Thread Alexander Shraer
> java.lang.RuntimeException: My id 2 not in the peer list

if the server's id is 2, a line for server 2 should be in the config file.
More generally, the dynamic config file should be the same at both servers
and include both servers. The documentation should be helpful.

https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html
https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html

-zoo_replicated1.cfg.dynamic:
server.1=zook.mydomain.com:2888:3888;
server.2=zook.mydomain.com::;

-zoo_replicated2.cfg.dynamic:
server.1=zook.mydomain.com:2888:3888;
server.2=zook.mydomain.com::;




On Fri, Jul 14, 2017 at 12:45 PM, Luigi Tagliamonte <
luigi.tagliamont...@gmail.com> wrote:

> Thank Alexander,
> I'm giving a shot to 3.5.3.
> I have 2 servers, the first one has:
>
> -zoo.cfg :
> tickTime=2000
> initLimit=10
> syncLimit=5
> dataDir=/var/lib/zookeeper/data
> reconfigEnabled=true
> standaloneEnabled=false
> dynamicConfigFile=/etc/zookeeper/bin/conf/zoo_replicated1.cfg.dynamic
>
> -zoo_replicated1.cfg.dynamic:
> server.1=zook.mydomain.com:2888:3888
>
> - myid: 1
>
> On the second server I'm using the same zoo.cfg and
> zoo_replicated1.cfg.dynamic
> and I only changed the id to 2.
> I'm getting the following in the logs:
>
> 2017-07-14 19:43:25,515 - INFO  [main:QuorumPeerConfig@117] - Reading
> configuration from: /etc/zookeeper/zoo.cfg
> 2017-07-14 19:43:25,518 - INFO  [main:QuorumPeerConfig@317] - clientPort
> is
> not set
> 2017-07-14 19:43:25,519 - INFO  [main:QuorumPeerConfig@331] -
> secureClientPort is not set
> 2017-07-14 19:43:25,579 - WARN  [main:QuorumPeerConfig@590] - No server
> failure will be tolerated. You need at least 3 servers.
> 2017-07-14 19:43:25,583 - INFO  [main:DatadirCleanupManager@78] -
> autopurge.snapRetainCount set to 3
> 2017-07-14 19:43:25,583 - INFO  [main:DatadirCleanupManager@79] -
> autopurge.purgeInterval set to 0
> 2017-07-14 19:43:25,583 - INFO  [main:DatadirCleanupManager@101] - Purge
> task is not scheduled.
> 2017-07-14 19:43:25,584 - INFO  [main:ManagedUtil@46] - Log4j found with
> jmx enabled.
> 2017-07-14 19:43:25,594 - INFO  [main:QuorumPeerMain@138] - Starting
> quorum
> peer
> 2017-07-14 19:43:25,617 - INFO  [main:Log@186] - Logging initialized
> @388ms
> 2017-07-14 19:43:25,661 - WARN  [main:ContextHandler@1339] -
> o.e.j.s.ServletContextHandler@6d78f375{/,null,null} contextPath ends with
> /*
> 2017-07-14 19:43:25,661 - WARN  [main:ContextHandler@1350] - Empty
> contextPath
> 2017-07-14 19:43:25,673 - INFO  [main:QuorumPeer@1349] - Local sessions
> disabled
> 2017-07-14 19:43:25,673 - INFO  [main:QuorumPeer@1360] - Local session
> upgrading disabled
> 2017-07-14 19:43:25,673 - INFO  [main:QuorumPeer@1327] - tickTime set to
> 2000
> 2017-07-14 19:43:25,673 - INFO  [main:QuorumPeer@1371] - minSessionTimeout
> set to 4000
> 2017-07-14 19:43:25,674 - INFO  [main:QuorumPeer@1382] - maxSessionTimeout
> set to 4
> 2017-07-14 19:43:25,674 - INFO  [main:QuorumPeer@1397] - initLimit set to
> 10
> 2017-07-14 19:43:25,685 - ERROR [main:QuorumPeerMain@98] - Unexpected
> exception, exiting abnormally
> java.lang.RuntimeException: My id 2 not in the peer list
> at org.apache.zookeeper.server.quorum.QuorumPeer.start(
> QuorumPeer.java:770)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.
> runFromConfig(QuorumPeerMain.java:185)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(
> QuorumPeerMain.java:120)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(
> QuorumPeerMain.java:79)
>
> What am I doing wrong? should the second server reach the first one, get
> the list of the other server in the ensemble and join it?
> Or I have to implement an automation on top of this?
> Regards
> L.
>
>
> On Fri, Jul 14, 2017 at 11:07 AM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
> > I'd suggest to use 3.5.3. ZK only officially supports a Java and C client
> > as far as I know. I know these two support it,
> > not sure if anyone ported it to other clients.
> >
> > Alex
> >
> >
> > On Fri, Jul 14, 2017 at 11:04 AM, Luigi Tagliamonte <
> > luigi.tagliamont...@gmail.com> wrote:
> >
> > > Hello again Alexander,
> > > so only Java and C clients support the new zk node discovery? right?
> > > Is there any specific version to use to be able to use this feature?
> > > Regards
> > > L.
> > >
> > > On Fri, Jul 14, 2017 at 10:37 AM, Luigi Tagliamonte <
> > > luigi.tagliamont...@gmail.com> wrote:
> > >
> > > > Hello Alexander,
> > > > thank you for the link I read the comment and the white paper and it
> > > seems
> > > > really promising.
> > > > I found though that Kafka isn't able yet to automatically reconfigure
> > his
> > > > zk nodes list.. do you happen to know different?
> > > > Regards
> > > > L.
> > > >
> > >
> >
>


Re: gracefully remove a node from the ensamble

2017-07-14 Thread Alexander Shraer
I'd suggest to use 3.5.3. ZK only officially supports a Java and C client
as far as I know. I know these two support it,
not sure if anyone ported it to other clients.

Alex


On Fri, Jul 14, 2017 at 11:04 AM, Luigi Tagliamonte <
luigi.tagliamont...@gmail.com> wrote:

> Hello again Alexander,
> so only Java and C clients support the new zk node discovery? right?
> Is there any specific version to use to be able to use this feature?
> Regards
> L.
>
> On Fri, Jul 14, 2017 at 10:37 AM, Luigi Tagliamonte <
> luigi.tagliamont...@gmail.com> wrote:
>
> > Hello Alexander,
> > thank you for the link I read the comment and the white paper and it
> seems
> > really promising.
> > I found though that Kafka isn't able yet to automatically reconfigure his
> > zk nodes list.. do you happen to know different?
> > Regards
> > L.
> >
>


Re: gracefully remove a node from the ensamble

2017-07-14 Thread Alexander Shraer
Well, I totally understand you. But on the other hand - I know that the
dynamic membership code has been running in production since 2012: link
<https://issues.apache.org/jira/browse/ZOOKEEPER-107?focusedCommentId=13566886=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13566886>
Of course it might have problems, but its been out for a while and I don't
think its less stable than 3.4. That's my personal opinion though :)

Cheers,
Alex

On Fri, Jul 14, 2017 at 9:26 AM, Luigi Tagliamonte <
luigi.tagliamont...@gmail.com> wrote:

> Thank you, Alexander!!!
> I'm wondering if would be a good idea to use 3.5 instead of 3.4... but
> since it is a beta I'm afraid to use it in production.
> I'm Cassandra user and I'm basically looking for the same level of
> reliability and orchestration I have there.
> Thank you!!
> Regards
> L.
>
> On Thu, Jul 13, 2017 at 6:19 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
> > Hi Luigi,
> >
> > In 3.5.X yes: https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.
> > html
> >
> > For previous releases (3.4 etc) you would need to do a rolling restart,
> > where for each server you change the config file to exclude that member
> > and bounce the server. Preferably do this one server at a time, and let
> the
> > ensemble be operational before bouncing the next server. And bounce
> > the leader last. I wouldn't call this gracefully though :)
> >
> >
> > Alex
> >
> > On Thu, Jul 13, 2017 at 3:22 PM, Luigi Tagliamonte <
> > luigi.tagliamont...@gmail.com> wrote:
> >
> > > Hello all!
> > > is there any document that describes how to remove a zk node from the
> > > ensemble?
> > >
> >
>


Re: gracefully remove a node from the ensamble

2017-07-13 Thread Alexander Shraer
Hi Luigi,

In 3.5.X yes: https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html

For previous releases (3.4 etc) you would need to do a rolling restart,
where for each server you change the config file to exclude that member
and bounce the server. Preferably do this one server at a time, and let the
ensemble be operational before bouncing the next server. And bounce
the leader last. I wouldn't call this gracefully though :)


Alex

On Thu, Jul 13, 2017 at 3:22 PM, Luigi Tagliamonte <
luigi.tagliamont...@gmail.com> wrote:

> Hello all!
> is there any document that describes how to remove a zk node from the
> ensemble?
>


Re: New to zookeeper

2017-07-12 Thread Alexander Shraer
Just a small comment - 3.5.3 is in beta. The getConfig API returns a list
of servers in the cluster, including their ports and roles in the ensemble.


Alex

On Wed, Jul 12, 2017 at 7:53 AM, Washko, Daniel  wrote:

> I speak strictly from my experience with Zookeeper and not an any official
> capacity of the project or of exhibitor.
>
> Exhibitor works great and allows you to easily automate clustering
> zookeeper nodes into an ensemble and discovering the individual nodes in
> the ensemble via an http call. We ran into a problem, though, after we
> implemented Exhibitor across our infrastructure. Every so often our
> Zookeeper ensembles lost the data they stored. While I cannot say this was
> caused by Exhibitor, we have Solr clouds where Exhibitor was not used and
> they never had this problem. My suspicion is that there was a problem with
> a zookeeper node and Exhibitor removed that node from the ensemble then did
> a rolling restart. When that node recovered for some reason the data was
> corrupted or lost. Exhibitor pulled that node back into the ensemble and
> did a rolling restart. That node became leader and when the others joined
> synced from that. Those nodes then dumped their data stored to be in sync
> with the leader. This is my speculation, I have had a very hard time
> replicating this and have not heard of anyone else having this problem.
> Again, I am not definitively saying Exhibitor is the cause of this but
> since we removed Exhibitor this problem has not occurred.
>
> Zookeeper 3.5.x branch adds discovery functionality and does automated
> clustering. It’s great, but from what I understand is still in alpha.
>
> Prior to the 3.5.x branch I know of no way to discover what nodes are
> actually in the ensemble. The 4 letter commands will tell you whether a
> node is in an ensemble, whether it is a leader or follower, but it will not
> tell you what ensemble it is in or list any other node information. If
> someone has a way to do this please post, because I have looked all over.
>
> We make use of Scalr and that adds an additional layer to automation. I
> run orchestration scripts in Scalr that discover the other running
> zookeeper nodes in (what Scalr calls) the same Farm Role. This script
> configures each node with the information for the other nodes and does a
> restart of Zookeeper to bring them into an ensemble. Then it collects this
> information and stores the IP addresses into a Global Variable in scalr
> that is available then to Solr. Changes to the ensemble are reflected in
> this variable that is then passed to the Solr cloud where a restart of the
> service will update the zookeeper information in Solr. We are working
> towards moving this functionality to Consul where it will register ther
> zookeeper ensemble information allowing Solr to pull it from Consul as
> opposed to relying on Global Variables. What I am getting at is that
> outside the 3.5.x branch, automating this takes a bit of work.
>
>
> --
> Daniel S Washko
> Solutions Architect
>
>
>
> dwas...@gannett.com  
>
> On 7/11/17, 6:58 PM, "Luigi Tagliamonte" 
> wrote:
>
> Hello, Zookeeper Users!
> I'm currently configuring/exploring zookeeper.
> I'm reading a lot about ensembles and scaling and I got some question
> that
> I'd like to submit to an expert audience.
> I need zookeeper as Kafka dependency so my deployment goal is the
> ensemble
> reliability especially because last Kafka version uses zookeeper only
> to
> store the leader partition.
>
> Here are my questions:
>
> - To manage the ensemble I decided to use exhibitor - what do you think
> about? Should I look to something else?
>
> - Is there a way to discover all the servers of an ensemble apart from
> use 4LTR? I wonder if it is possible to do something like in Cassandra
> were
> you contact one node and you can get the whole cluster info from it.
> should
> I configure just a DNS per zookeeper server, this doesn't scale well
> in a
> dynamic env like servers in autoscaling.
>
> - is there any white paper that shows a real scalable and reliable
> Zookeeper installation? Any resources are welcome!
>
> Thank you all in advance!
> Regards
>
>
>


Re: New PMC Member: Michael Han

2017-06-27 Thread Alexander Shraer
congrats Michael!!

On Tue, Jun 27, 2017 at 6:04 PM, Gaurav Sharma  wrote:

> Congrats Michael!
>
> On Tue, Jun 27, 2017 at 09:48 Flavio Junqueira  wrote:
>
> > I'm very happy to announce that the Apache ZooKeeper PMC has voted to
> > invite Michael Han to join the PMC and Michael accepted. Michael has done
> > outstanding work in the community over the recent past and we felt it was
> > time for Michael to deepen his level of engagement by joining the PMC.
> >
> > Please join me in congratulating Michael for his achievement.
> > Congratulations, Michael!
> >
> > -Flavio
> >
> >
> >
>


Re: How to add nodes to a Zookeeper 3.5.3-beta ensemble with reconfigEnabled=false

2017-06-23 Thread Alexander Shraer
> 2. Even with the reconfig CLI, if there is no quorum, it is not possible
> to re-configue the ensemble, so one has to fall back to modify the
> ensemble through modification of zoo.cfg and restart.

Just like any other update operation in ZK, reconfig isn't available when
you loose quorum.
In your scenario reconfiguring from 5 to 2 without talking with the
disconnected 3 servers may
cause data loss. But I can understand that sometimes this may be needed. I
think you should
be able to change the config files to do that if you wanted to, but it may
be tricky without restarting
both at the same time since we use configuration ids to understand which
config is more up-to-date
and if you restart just one server and remove that id, the other server may
push its own config to that
second server during leader election...

On Fri, Jun 23, 2017 at 9:12 AM, Michael Han <h...@cloudera.com> wrote:

> Guillermo, thank you for reporting the issue and share you findings on the
> workaround.
>
> I filed https://issues.apache.org/jira/browse/ZOOKEEPER-2819 which should
> provide expected behavior - when reconfigEnabled=false you should be able
> to do rolling restarts the old ways, once that JIRA is fixed.
>
>
> On Fri, Jun 23, 2017 at 7:18 AM, Guillermo Vega-Toro <gvega...@us.ibm.com>
> wrote:
>
> > Thanks all for looking at this.
> >
> > Here is what I've found to make ensemble config changes work with
> > 3.5.3-beta and reconfigEnabled=false:
> >
> > 1. All servers must be stopped.
> > 2. In one server, make the desired changes to zoo.cfg (group, server.x,
> > weight.x), and delete the dynamicConfigFile property.
> > 3. Start the one server above. You will notice that when the server
> > starts, no dynamic file is created, as if the config is "on hold".
> > 4. In another server, make the same changes to zoo.cfg, and start the
> > server. If there is no quorum, no dynamic file will be created.
> > 5. Repeat step 4 for all servers
> >
> > Once a quorum is reached, the proposed config changes are applied, and a
> > dynamic config file will appear on the servers. If you run a 'config'
> > command on the zkCLI, the desired configuration will show.
> >
> > Also, once quorum is reached, further ensemble changes to zoo.cfg
> followed
> > by restart of a single server are ignored. It is necessary to stop all
> > servers and do the steps above to make any changes to the ensemble.
> >
> > Thanks,
> >
> > Alexander Shraer <shra...@gmail.com> wrote on 06/23/2017 01:20:47 AM:
> >
> > > From: Alexander Shraer <shra...@gmail.com>
> > > To: user@zookeeper.apache.org
> > > Date: 06/23/2017 01:21 AM
> > > Subject: Re: How to add nodes to a Zookeeper 3.5.3-beta ensemble
> > > with reconfigEnabled=false
> > >
> > > I'm not sure it's necessary for backward compatibility since rolling
> > > restarts for config changes are not really an api the system provides.
> > >
> > > I'd think the ACL control and admin only API should be sufficient for
> > > security and would prefer to get rid of the flag. But if you must have
> > it,
> > > we have to prevent both in memory config updates (most important) and
> > > config file updates if reconfig is disabled. This sounds like a small
> > > change in quorumpeer, but perhaps I'm forgetting something.
> > >
> > > Cheers
> > > Alex
> > >
> > >
> > > On Thu, Jun 22, 2017 at 11:06 PM Michael Han <h...@cloudera.com>
> wrote:
> > >
> > > > Hi Alex, thanks for clarification!
> > > >
> > > > It makes sense to me that users should use reconfig instead of
> rolling
> > > > upgrade moving forward. The only concern is backward compatibility
> now
> > we
> > > > drop the old rolling upgrade support: since 3.5.x needs to be
> backward
> > > > compatible with 3.4.x [1], I think we probably need support rolling
> > upgrade
> > > > for 3.5 branch.
> > > >
> > > > As for this flag - I believe it's there and set to false because
> > reconfig
> > > > is a security sensitive feature and for such features user has to opt
> > in
> > > > explicitly. Our security team here also has similar recommendations
> > when I
> > > > talked with them about what this feature could do. There are also
> some
> > > > discussions around this flag / why it's there in ZOOKEEPER-2014.
> > > >
> > > > [1]
> > > >
> > https://cwiki.apache.org/confluence/display/ZOOKEEPER

Re: How to add nodes to a Zookeeper 3.5.3-beta ensemble with reconfigEnabled=false

2017-06-23 Thread Alexander Shraer
I'm not sure it's necessary for backward compatibility since rolling
restarts for config changes are not really an api the system provides.

I'd think the ACL control and admin only API should be sufficient for
security and would prefer to get rid of the flag. But if you must have it,
we have to prevent both in memory config updates (most important) and
config file updates if reconfig is disabled. This sounds like a small
change in quorumpeer, but perhaps I'm forgetting something.

Cheers
Alex


On Thu, Jun 22, 2017 at 11:06 PM Michael Han <h...@cloudera.com> wrote:

> Hi Alex, thanks for clarification!
>
> It makes sense to me that users should use reconfig instead of rolling
> upgrade moving forward. The only concern is backward compatibility now we
> drop the old rolling upgrade support: since 3.5.x needs to be backward
> compatible with 3.4.x [1], I think we probably need support rolling upgrade
> for 3.5 branch.
>
> As for this flag - I believe it's there and set to false because reconfig
> is a security sensitive feature and for such features user has to opt in
> explicitly. Our security team here also has similar recommendations when I
> talked with them about what this feature could do. There are also some
> discussions around this flag / why it's there in ZOOKEEPER-2014.
>
> [1]
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>
>
> On Thu, Jun 22, 2017 at 10:39 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
> > Hi Michael,
> >
> > The described behavior is the intended one - in 3.5 configuration is part
> > of the synced state and is updated
> > when the server syncs with the leader. The only rolling upgrade I tested
> > was to upgrade the software version
> > of the servers - this should still work. But I didn't try to support
> > rolling upgrade for upgrading the configuration,
> > since this should be done through reconfig.
> >
> > I'm still not sure what's the purpose of this flag btw. Why would someone
> > want to do rolling restarts which are prone
> > to inconsistencies and data loss, when they can use reconfig ?
> >
> > Alex
> >
> >
> >
> >
> > On Thu, Jun 22, 2017 at 10:18 PM, Michael Han <h...@cloudera.com> wrote:
> >
> > > reconfigEnabled only disables reconfig command when
> > reconfigEnabled=false;
> > > it does not disable the feature by mute all code paths of the reconfig
> > > feature introduced in ZOOKEEPER-107. So regardless of the value of
> > > reconfigEnabled,
> > > 3.5.x ZK will create static config file and dynamic config file in any
> > > cases.
> > >
> > > This might create a problem for users who want to do rolling upgrade
> the
> > > old way - because now the critical config information is not stored in
> > > zoo.cfg anymore and modifying cfg.dynamic file manually will not work
> > > because a reconfig needs to go through quorum processors. I think this
> is
> > > the problem described in the thread.
> > >
> > > Alex, is reconfig compatible with rolling upgrade? I don't find
> anything
> > > mentioned in ZOOKEEPER-107 about this. Currently I think the answer is
> > no,
> > > which means for 3.5.x the only way to change membership of cluster is
> > > through reconfig. Could you confirm this conclusion? If that is the
> case
> > we
> > > need patch the reconfigEnabled so it completely disable all code path
> of
> > > the reconfig feature to leave the static zoo.cfg intact.
> > >
> > >
> > > On Thu, Jun 22, 2017 at 9:35 PM, Alexander Shraer <shra...@gmail.com>
> > > wrote:
> > >
> > > > This sounds like a bug in the implementation of reconfigEnabled.
> > > > Could you please open a JIRA with the description you provided ?
> > > >
> > > > Out of curiosity, why do you disable reconfig ? It is intended
> exactly
> > > > to perform the changes you're trying to make, in a simple and correct
> > > way.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > > > On Thu, Jun 22, 2017 at 3:17 PM, Guillermo Vega-Toro <
> > > gvega...@us.ibm.com>
> > > > wrote:
> > > >
> > > > > I'm still unable to make configuration changes when
> > > reconfigEnabled=false
> > > > > by updating zoo.cfg and restarting the servers.
> > > > >
> > > > > For example, I want to change the weight of one of my servers. I
> edit
> > > > > zoo.cfg on the server I want t

Re: How to add nodes to a Zookeeper 3.5.3-beta ensemble with reconfigEnabled=false

2017-06-22 Thread Alexander Shraer
Hi Michael,

The described behavior is the intended one - in 3.5 configuration is part
of the synced state and is updated
when the server syncs with the leader. The only rolling upgrade I tested
was to upgrade the software version
of the servers - this should still work. But I didn't try to support
rolling upgrade for upgrading the configuration,
since this should be done through reconfig.

I'm still not sure what's the purpose of this flag btw. Why would someone
want to do rolling restarts which are prone
to inconsistencies and data loss, when they can use reconfig ?

Alex




On Thu, Jun 22, 2017 at 10:18 PM, Michael Han <h...@cloudera.com> wrote:

> reconfigEnabled only disables reconfig command when reconfigEnabled=false;
> it does not disable the feature by mute all code paths of the reconfig
> feature introduced in ZOOKEEPER-107. So regardless of the value of
> reconfigEnabled,
> 3.5.x ZK will create static config file and dynamic config file in any
> cases.
>
> This might create a problem for users who want to do rolling upgrade the
> old way - because now the critical config information is not stored in
> zoo.cfg anymore and modifying cfg.dynamic file manually will not work
> because a reconfig needs to go through quorum processors. I think this is
> the problem described in the thread.
>
> Alex, is reconfig compatible with rolling upgrade? I don't find anything
> mentioned in ZOOKEEPER-107 about this. Currently I think the answer is no,
> which means for 3.5.x the only way to change membership of cluster is
> through reconfig. Could you confirm this conclusion? If that is the case we
> need patch the reconfigEnabled so it completely disable all code path of
> the reconfig feature to leave the static zoo.cfg intact.
>
>
> On Thu, Jun 22, 2017 at 9:35 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
> > This sounds like a bug in the implementation of reconfigEnabled.
> > Could you please open a JIRA with the description you provided ?
> >
> > Out of curiosity, why do you disable reconfig ? It is intended exactly
> > to perform the changes you're trying to make, in a simple and correct
> way.
> >
> > Thanks,
> > Alex
> >
> > On Thu, Jun 22, 2017 at 3:17 PM, Guillermo Vega-Toro <
> gvega...@us.ibm.com>
> > wrote:
> >
> > > I'm still unable to make configuration changes when
> reconfigEnabled=false
> > > by updating zoo.cfg and restarting the servers.
> > >
> > > For example, I want to change the weight of one of my servers. I edit
> > > zoo.cfg on the server I want to change, and specify the group,
> server.x,
> > > and weight.x properties for all servers. I also remove the
> > > dynamicConfigFile property and delete the dynamic config file. I then
> > > restart the server. As soon as the server starts, the dynamic config
> file
> > > re-appears, and it has the last configuration, as if the changes I made
> > in
> > > zoo.cfg were ignored. The dynamic configuration file on the other
> servers
> > > also stays the same.
> > >
> > > What would be the correct way to achieve this (change a server's
> weight,
> > > or role) when reconfigEnabled=false and the CLI reconfig command cannot
> > be
> > > used?
> > >
> > > Thanks
> > >
> > >
> >
>
>
>
> --
> Cheers
> Michael.
>


Re: How to add nodes to a Zookeeper 3.5.3-beta ensemble with reconfigEnabled=false

2017-06-22 Thread Alexander Shraer
This sounds like a bug in the implementation of reconfigEnabled.
Could you please open a JIRA with the description you provided ?

Out of curiosity, why do you disable reconfig ? It is intended exactly
to perform the changes you're trying to make, in a simple and correct way.

Thanks,
Alex

On Thu, Jun 22, 2017 at 3:17 PM, Guillermo Vega-Toro 
wrote:

> I'm still unable to make configuration changes when reconfigEnabled=false
> by updating zoo.cfg and restarting the servers.
>
> For example, I want to change the weight of one of my servers. I edit
> zoo.cfg on the server I want to change, and specify the group, server.x,
> and weight.x properties for all servers. I also remove the
> dynamicConfigFile property and delete the dynamic config file. I then
> restart the server. As soon as the server starts, the dynamic config file
> re-appears, and it has the last configuration, as if the changes I made in
> zoo.cfg were ignored. The dynamic configuration file on the other servers
> also stays the same.
>
> What would be the correct way to achieve this (change a server's weight,
> or role) when reconfigEnabled=false and the CLI reconfig command cannot be
> used?
>
> Thanks
>
>


Re: [ANNOUNCE] Apache ZooKeeper 3.5.3-beta

2017-04-20 Thread Alexander Shraer
The issue Patrick was referring to is described here:
https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html#ch_reconfig_upgrade

On Thu, Apr 20, 2017 at 9:22 AM, Patrick Hunt  wrote:

> On Thu, Apr 20, 2017 at 9:12 AM, Michael Han  wrote:
>
> > Just to clarify, 4LW feature is not removed in latest releases (3.4.10
> and
> > 3.5.3-beta). The feature is still there, it's just disabled by default.
> You
> > can enable the feature if you need it (details in the admin documents).
> > Because of the compatibility guarantees provided by ZooKeeper, we
> wouldn't
> > just remove a feature lightly.
> >
> >
> 3.5 also adds json support through jetty/http, a significant improvement
> over 4lw.
>
> Patrick
>
>
> > On Thu, Apr 20, 2017 at 6:53 AM, Ben Sherman 
> wrote:
> >
> > > Thanks for the 4lw warning - I was going to upgrade to 3.4.10 today but
> > > didn't expect features to be removed.  It's a shame they are going
> away,
> > > human readable output from a one line script was a nice feature to have
> > by
> > > default.
> > >
> > > On Wed, Apr 19, 2017 at 6:02 PM, Michael Han 
> wrote:
> > >
> > > > >> pitfalls coming from 3.4.9 (or .10) to the 3.5.x release?
> > > > If coming from 3.4.9, one note is all four letter words except srvr
> are
> > > > disabled by default in 3.5.3 so your devops tool if they depend on
> 4lw
> > > will
> > > > stop working (one user already reports this on jira), which is
> > expected.
> > > In
> > > > this case you can either update configuration to enable the subset of
> > 4lw
> > > > you need, or use modern monitoring primitives provided by ZK (JMX /
> > Jetty
> > > > admin server). If coming from 3.4.10 then it's fine, since 3.4.10
> made
> > > same
> > > > change to 4lw (disable by default).
> > > >
> > > > >> make a change to 3.4.x (x>0) in order to maintain backward compat
> > > with a
> > > > change that we made to 3.5.
> > > > Not sure if it's ZOOKEEPER-1633. Basically rolling upgrade would not
> > work
> > > > from 3.4.x to 3.5.y if x < 6.
> > > >
> > > > On Wed, Apr 19, 2017 at 5:16 PM, Patrick Hunt 
> > wrote:
> > > >
> > > > > I remember we had to make a change to 3.4.x (x>0) in order to
> > maintain
> > > > > backward compat with a change that we made to 3.5. I searched but I
> > > can't
> > > > > remember the specific jira or the specific release, it was some
> time
> > > ago.
> > > > > The issue would be that if you try and do a rolling upgrade from
> > > 3.4.x-1
> > > > to
> > > > > 3.5.y it had the potential to fail. Perhaps one of the other
> > community
> > > > > folks will remember. Other than that I'm not aware of anything. The
> > on
> > > > disk
> > > > > formats are the same and the communication protocols should be b/w
> > > > compat.
> > > > > I tried running 3.4 client against 3.5.3 during the last release
> and
> > it
> > > > > worked ok for me. Not sure if anyone has been testing at the quorum
> > > > level.
> > > > >
> > > > > If anyone does find something (or tests and finds it works) please
> > let
> > > us
> > > > > know so that we can document it.
> > > > >
> > > > > Patrick
> > > > >
> > > > > On Wed, Apr 19, 2017 at 4:14 PM, Ben Sherman  >
> > > > wrote:
> > > > >
> > > > > > Great news, are there any docs written yet or any known pitfalls
> > > coming
> > > > > > from 3.4.9 (or .10) to the 3.5.x release?
> > > > > >
> > > > > > On Mon, Apr 17, 2017 at 10:48 AM, Michael Han  >
> > > > wrote:
> > > > > >
> > > > > > > The Apache ZooKeeper team is proud to announce Apache ZooKeeper
> > > > version
> > > > > > > *3.5.3-beta*.
> > > > > > >
> > > > > > > ZooKeeper is a high-performance coordination service for
> > > distributed
> > > > > > > applications. It exposes common services - such as naming,
> > > > > > > configuration management, synchronization, and group services -
> > in
> > > a
> > > > > > > simple interface so you don't have to write them from scratch.
> > You
> > > > can
> > > > > > > use it off-the-shelf to implement consensus, group management,
> > > leader
> > > > > > > election, and presence protocols. And you can build on it for
> > your
> > > > > > > own, specific needs.
> > > > > > >
> > > > > > > For ZooKeeper release details and downloads, visit:
> > > > > > > https://zookeeper.apache.org/releases.html
> > > > > > >
> > > > > > > ZooKeeper 3.5.3-beta Release Notes are at:
> > > > > > > https://zookeeper.apache.org/doc/r3.5.3-beta/releasenotes.html
> > > > > > >
> > > > > > > We would like to thank the contributors that made the release
> > > > possible.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > The ZooKeeper Team
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Cheers
> > > > Michael.
> > > >
> > >
> >
> >
> >
> > --
> > Cheers
> > Michael.
> >
>


Re: Zookeeper Ensemble Automation

2017-01-05 Thread Alexander Shraer
Since configuration info is stored in a znode, you could access it using a
simple get operation. The getconfig operation is basically doing just that.
So if you have a 3.5 server and a 3.4 client, the client should be able to
read the list of servers and get notified when the list changes by setting
a watch.

The 3.5 client has an updateServerList operation, which allows to create a
zk handle with one list and later update the list.


Thanks,
Alex

On Thu, Jan 5, 2017 at 12:41 PM, Shawn Heisey  wrote:

> On 1/5/2017 11:19 AM, Washko, Daniel wrote:
> > Thanks for the reply Shawn. I would like to clarify something though.
> > Right now, the Dynamic Reconfiguration of Zookeeper works for
> > Zookeeper – that is adding/removing nodes automatically without having
> > to reconfigure each zookeeper node manually. Once Zookeeper is out of
> > Alpha then Solr will be updated to take advantage of the Dynamic
> > Reconfiguration capability of Zookeeper and auto-discover any changes.
> > Is that correct?
>
> I am not sure whether my understanding is correct, but if it is, then I
> don't think a zookeeper 3.4.x client (like the one in Solr) will notice
> that the server list (with servers running 3.5.x) has changed.
> Depending on exactly how the membership changed, the SolrCloud instance
> might not be able to maintain a viable ZK quorum.  If it loses quorum,
> SolrCloud goes read-only.
>
> After ZK 3.5 goes through the beta phase and reaches stable, then Solr
> will get the upgrade, and we will make sure that the dynamic
> reconfiguration works.  It's a feature that we definitely want, though
> we may wait for the second or third stable release before we upgrade to
> be absolutely certain that it's solid.
>
> There are a couple of questions I do not know the answer to:  1) Whether
> any code changes will be required in Solr to take advantage of dynamic
> reconfiguration after the dependency upgrade.  2) Whether a Solr
> instance with the 3.5 client could be told about only one ZK server,
> then discover the whole cluster and connect to all the servers.  Can a
> more knowledgeable member of this community answer these questions for me?
>
> Thanks,
> Shawn
>
>


Re: november meetup at facebook (take 2)

2016-09-30 Thread Alexander Shraer
+1 for me too, thanks!

On Fri, Sep 30, 2016 at 3:18 PM, Ryan Zhang 
wrote:

> +1. My coworkers in twitter would be interested.
>
> > On Sep 30, 2016, at 2:35 PM, Raúl Gutiérrez Segalés 
> wrote:
> >
> > +1 (probably bringing along some people from Pinterest as well).
> >
> > -rgs
> >
> > On Sep 30, 2016 2:26 PM, "Marshall McMullen" <
> marshall.mcmul...@gmail.com>
> > wrote:
> >
> > +1. I would love to attend along with a few of my coworkers and this date
> > works for us.
> >
> > On Fri, Sep 30, 2016 at 3:10 PM, Benjamin Reed  wrote:
> >
> >> facebook would like to host a zookeeper meetup in our offices in menlo
> >> park, ca on november 17th (a thursday). before sending out an official
> >> invitation with details about logistics, i thought i would first do a
> > quick
> >> date check and make sure that there isn't a big scheduling conflict that
> > we
> >> didn't notice (like a big election or something like that...). it's a
> bit
> >> tricky to book facilities here, so we don't have a lot of options on
> > dates.
> >>
> >> would this date work for most people?
> >>
> >> thanx
> >> ben
> >>
> >> ps - should i cross post to dev@? i assume that most subscribers of
> user@
> >> also subscribe to dev@
> >>
>
>


Re: Error Start ZK 3.5.1 a second time

2016-07-14 Thread Alexander Shraer
Sounds like definitely a bug :) could you please open a JIRA ?
And if you can upload a patch this would be very appreciated. This code
should be in QuorumPeerConfig.java

String dynamicConfigFilePath = PathUtils.normalizeFileSystemPath

Thanks,
Alex

On Thu, Jul 14, 2016 at 8:36 AM, Cantrell, Curtis 
wrote:

> Ok.  I'm on a windows machine.   All I had to do was to open the zoo.cfg
> that had been written by zookeeper when the backup was created and turn the
> slashes around, and now the server comes up...
>
> dynamicConfigFile=D:/zookeeper1-3.5.1/conf/zoo.cfg.dynamic.1
>
> It looks like there is something that cannot handle a forward slash and
> zookeeper is writing the forward slash itself.
>
> Thank you,
> All solved  but should I open a ticket?  Is this a bug?
>
> Thank you,
> Curtis
>
>
> -Original Message-
> From: Cantrell, Curtis [mailto:curtis.cantr...@bkfs.com]
> Sent: Thursday, July 14, 2016 11:06 AM
> To: user@zookeeper.apache.org
> Subject: RE: Error Start ZK 3.5.1 a second time
>
> I looks like a file separator issue when reading where the
> dynamicConfigFile is location from the zoo.cfg
>
> This is what is written in the zoo.cfg
>
>
>  dynamicConfigFile=D:\zookeeper3-3.5.1\conf\zoo.cfg.dynamic.1
>
> But this is the complaint of the FileNotFoundException on startup.
>
>Caused by: java.io.FileNotFoundException:
> D:zookeeper1-3.5.1confzoo.cfg.dynamic.1 (The system cannot find the
> file specified)
>
> Is there a file separator problem?
>
> Thank you,
> Curtis
>
>
>
> The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
> The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
>


Re: how is zookeeper deploy at multi datacenter?

2016-06-29 Thread Alexander Shraer
our recent paper may be relevant:
https://www.usenix.org/conference/atc16/technical-sessions/presentation/lev-ari

On Wed, Jun 29, 2016 at 10:04 PM, chen dongming 
wrote:

> How many ways to deploy at multi datacenter for backup?
>
>  From my point view:
>
> 1. use observer
>
>  use only 1 ensemble
>
>  one datatcenter as main datacenter with leader and follower
>
>  other datacenters only with 3 observers
>
>  When main datacenter crash, select one datacenter as new main
> datacenter, and convert observers to leader/follower manually.
>
> 2. sync data at app level
>
>  use multi ensembles for each datacenter
>
>  sync data at app level, and app make sure no data conflict between
> ensembles.
>
> Is there any other way to deploy multi datacenter for backup?
>
> At last, I notice issue ZOOKEEPER-892 discontinue, why ? And zoorepl is
> suitable for multi datacenter for buckup?
>
>
>
>


Re: read under transaction

2016-06-28 Thread Alexander Shraer
But these writes can be conditional (on the version of the data), which
could
probably be used to achieve what you need.

On Tue, Jun 28, 2016 at 11:33 AM, Patrick Hunt  wrote:

> Multi is more of an atomic operation than a "transaction" in the typical
> sense. See https://issues.apache.org/jira/browse/ZOOKEEPER-965 for some
> background. I don't believe the original use case involved reading multiple
> znodes, rather updating multiple.
>
> Patrick
>
> On Mon, Jun 20, 2016 at 2:33 PM, Denis Samoilov 
> wrote:
>
> > hi,
> > I see that there is multi() function to write data under transaction. But
> > it accepts only mutation operations. Is it possible to read under
> > transaction somehow (so data will be consistent)?
> >
> > Thank you!
> >
>


Re: observer changing to participant when there is no quorum

2016-06-15 Thread Alexander Shraer
ZooKeeper only works when a majority of the participants are up.
Since 2 out of 3 participants in your ensemble are down, ZooKeeper won't
allow you to issue any commands, including a reconfiguration.
You should have enough participants such that a situation where a majority
is simultaneously down doesn't happen.


On Wed, Jun 15, 2016 at 4:33 PM, Nomar Morado <j316servi...@icloud.com>
wrote:

> Corrected typos
>
> Printing e-mails wastes valuable natural resources. Please don't print
> this message unless it is absolutely necessary. Thank you for thinking
> green!
>
> Sent from my iPhone
>
> > On Jun 15, 2016, at 9:31 AM, Nomar Morado <j316servi...@icloud.com>
> wrote:
> >
> > This is what I have done so far:
> >
> > A,B,C are participants
> > C,D are observers
> >
> > B,C are offline (crashed)
> >
> > I am trying to:
> >
> > 1. Remove C, D
> > 2. Add C,D back as participants
> >
> > Will this work?
> >
> > At least in my testing (might be doing wrong) I am getting this error on
> the first step and hence can't get forward:
> >
> > Client could not connect to reestablished  quorum: giving up after 30+
> seconds
> >
> >
> > I am passing the original server configure string to zk's reconfig
> method.
> >
> >
> >
> > Thanks
> >
> >
> > Printing e-mails wastes valuable natural resources. Please don't print
> this message unless it is absolutely necessary. Thank you for thinking
> green!
> >
> > Sent from my iPhone
> >
> >> On Jun 14, 2016, at 10:55 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
> >>
> >> Right, a quorum of participants from the old config is required to
> process
> >> any command, including reconfig,
> >> and a quorum of participants from the new config is required for the
> >> reconfig to even start. If there's no such connected
> >> quorum an error NewConfigNoQuorum will be thrown.
> >>
> >> But there is one slightly confusing case where the error is thrown,
> which
> >> is explained in the doc: when you are
> >> converting an observer to a participant and there is no quorum in the
> new
> >> config without counting that "future" participant.
> >> So the server is connected, but since its not a participant we get the
> >> error above.  In that case, one first needs to
> >> convert the observer to remove the observer and then add it back. The
> >> detailed explanation is in the doc, look for
> >> "Changing an observer into a follower".
> >>
> >> On Wed, Jun 15, 2016 at 1:17 AM, Camille Fournier <cami...@apache.org>
> >> wrote:
> >>
> >>> I'm finding the documentation quite confusing. I was under the
> impression
> >>> that quorum of some sort was needed to do a reconfig. Can you reconfig
> when
> >>> there is no quorum?
> >>>
> >>> *Progress guarantees:* Up to the invocation of the reconfig operation,
> a
> >>> quorum of the old configuration is required to be available and
> connected
> >>> for ZooKeeper to be able to make progress. Once reconfig is invoked, a
> >>> quorum of both the old and of the new configurations must be available.
> >>>
> >>> *Adding servers:* Before a reconfiguration is invoked, the
> administrator
> >>> must make sure that a quorum (majority) of participants from the new
> >>> configuration are already connected and synced with the current leader.
> >>>
> >>>
> >>>
> >>> On Tue, Jun 14, 2016 at 5:35 PM, Alexander Shraer <shra...@gmail.com>
> >>> wrote:
> >>>
> >>>> This is needed only in case the target config doesn't have a quorum
> which
> >>>> are already followers in the old config
> >>>> and are up. We need agreement from a quorum of the target config, but
> >>>> observers aren't participating
> >>>> in the voting protocol.
> >>>>
> >>>>> On Tue, Jun 14, 2016 at 7:35 PM, Michael Han <h...@cloudera.com>
> wrote:
> >>>>>
> >>>>> This might help:
> >>>>> https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html
> section
> >>>>> '*Changing
> >>>>> an observer into a follower:'*
> >>>>> "first invoke a reconfig to remove D from the configuration and then
> >>>> invoke
> >>>>> a second command to add it back as a participant (follower)."
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 14, 2016 at 8:53 AM, Nomar Morado <
> nomar.mor...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> I was trying to promote an observer into participant when ZK loses
> >>>>> quorum -
> >>>>>> but it seems that it does not allow to.
> >>>>>>
> >>>>>> Would you know how this can be accomplished without having to
> recycle
> >>>> ZK?
> >>>>>>
> >>>>>> I am using 3.5.0-alpha
> >>>>>>
> >>>>>>
> >>>>>> Thanks.
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Cheers
> >>>>> Michael.
> >>>
>


Re: Zookeeper 3.5.1 dynamic configuration fails with EOFException

2016-06-10 Thread Alexander Shraer
even if you start 2 as follower, it may restart leader election and drop a
connection since it learns about a more up to date configuration.
We didn't optimize such restarts for simplicity.

On Fri, Jun 10, 2016 at 9:16 PM, Alexander Shraer <shra...@gmail.com> wrote:

> In this specific case, the initial failure could be explained since server
> 1 will push its config to server 2, then server 2  finds out that instead
> of observer
> it must be a "non voting follower", which will cause it to throw an
> exception and load a different stack of protocols, and restarts leader
> election. This may
> explain the failure. Then they connect normally and 2 becomes FOLLOWING
> and gets an UPTODATE message as you see in the log. This looks ok.
> At the end, 2 isn't a real follower, in order to add it to the ensemble
> you need to issue a reconfig command.
>
> On Fri, Jun 10, 2016 at 3:18 PM, Sebastian Mattheis <
> sebastian.matth...@bmw-carit.de> wrote:
>
>> Zookeeper 3.5.1 (
>> https://github.com/apache/zookeeper/releases/tag/release-3.5.1) dynamic
>> configuration fails with two servers that are started one after the other
>> throwing an EOFException. This is the same, if server 2 is configured as
>> observer or participant. Is the usage wrong? Is this a known bug?
>>
>> # server.1:
>>
>> ## /opt/zookeeper/var/myid
>> 1
>>
>> ## /opt/zookeeper/conf/zoo.cfg
>> autopurge.purgeInterval=1
>> initLimit=10
>> syncLimit=5
>> autopurge.snapRetainCount=4
>> tickTime=2000
>> dataDir=/opt/zookeeper/var
>> dataLogDir=/opt/zookeeper/logs
>> standaloneEnabled=false
>> dynamicConfigFile=/opt/zookeeper/conf/zoo.cfg.dynamic
>>
>> ## /opt/zookeeper/conf/zoo.cfg.dynamic
>> server.1=192.168.99.100:2888:3888;2181
>>
>> ## Output (on connection of server.2 after start-up):
>>
>> …
>> 2016-06-09 09:22:32,309 [myid:1] - INFO
>> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
>> :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection
>> from /192.168.99.101:43222
>> 2016-06-09 09:22:32,378 [myid:1] - INFO
>> [NIOWorkerThread-1:ZooKeeperServer@964] - Client attempting to establish
>> new session at /192.168.99.101:43222
>> 2016-06-09 09:22:32,382 [myid:1] - INFO  [SyncThread:1:FileTxnLog@200] -
>> Creating new log file: log.10001
>> 2016-06-09 09:22:32,392 [myid:1] - INFO
>> [CommitProcWorkThread-1:ZooKeeperServer@678] - Established session
>> 0x10001af5f77 with negotiated timeout 3 for client /
>> 192.168.99.101:43222
>> 2016-06-09 09:22:32,410 [myid:1] - WARN
>> [NIOWorkerThread-1:NIOServerCnxn@365] - Unable to read additional data
>> from client sessionid 0x10001af5f77, likely client has closed socket
>> 2016-06-09 09:22:32,411 [myid:1] - INFO
>> [NIOWorkerThread-1:MBeanRegistry@119] - Unregister MBean
>> [org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Leader,name3=Connections,name4=192.168.99.101,name5=0x10001af5f77]
>> 2016-06-09 09:22:32,412 [myid:1] - INFO
>> [NIOWorkerThread-1:NIOServerCnxn@606] - Closed socket connection for
>> client /192.168.99.101:43222 which had sessionid 0x10001af5f77
>> 2016-06-09 09:22:33,133 [myid:1] - INFO  [/192.168.99.100:3888
>> :QuorumCnxManager$Listener@637] - Received connection request /
>> 192.168.99.101:44396
>> 2016-06-09 09:22:33,154 [myid:1] - WARN
>> [RecvWorker:2:QuorumCnxManager$RecvWorker@917] - Connection broken for
>> id 2, my id = 1, error =
>> java.io.EOFException
>> at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:902)
>> 2016-06-09 09:22:33,158 [myid:1] - WARN
>> [RecvWorker:2:QuorumCnxManager$RecvWorker@920] - Interrupting SendWorker
>> 2016-06-09 09:22:33,159 [myid:1] - WARN
>> [SendWorker:2:QuorumCnxManager$SendWorker@834] - Interrupted while
>> waiting for message on queue
>> java.lang.InterruptedException
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>> at
>> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>> at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>> at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.

Re: zookeeper deployment strategy for multi data centers

2016-06-03 Thread Alexander Shraer
> Is there any settings to override the quorum rule? Would you know the
rationale behind it?

The rule comes from a theoretical impossibility saying that you must have n
> 2f replicas
to tolerate f failures, for any algorithm trying to solve consensus while
being able to handle
periods of asynchrony (unbounded message delays, processing times, etc).
The earliest proof is probably here: paper
.
ZooKeeper is assuming this model, so the bound applies
to it.

The intuition is what's called a 'partition argument'. Essentially if only
2f replicas were sufficient, you
could arbitrarily divide them into 2 sets of f replicas, and create a
situation where each set of f
must go on independently without coordinating with the other set (split
brain), when the links between the two sets are slow (i.e., a network
partition),
simply because the other set could also be down (the algorithm tolerates f
failures) and it can't distinguish the two situations.
When n > 2f this can be avoided since one of the sets will have majority
while the other set won't.

The key here is that the links between the two data centers can arbitrarily
delay messages, so an automatic
'fail-over' where one data center decides that the other one is down is
usually considered unsafe. If in your system
you have a reliable way to know that the other data center is really in
fact down (this is a synchrony assumption), you could do as Camille
suggested and
reconfigure the system to only include the remaining data center. This
would still be very tricky to do since this reconfiguration
would have to involve manually changing configuration files and rebooting
servers, while somehow making sure that you're
not loosing committed state. So not recommended.



On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier 
wrote:

> 2 servers is the same as 1 server wrt fault tolerance, so yes, you are
> correct. If they want fault tolerance, they have to run 3 (or more).
>
> On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey  wrote:
>
> > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > Is there any settings to override the quorum rule? Would you know the
> > > rationale behind it? Ideally, you will want to operate the application
> > > even if at least one data center is up.
> >
> > I do not know if the quorum rule can be overridden, or whether your
> > application can tell the difference between a loss of quorum and
> > zookeeper going down entirely.  I really don't know anything about
> > zookeeper client code or zookeeper internals.
> >
> > From what I understand, majority quorum is the only way to be
> > *completely* sure that cluster software like SolrCloud or your
> > application can handle write operations with confidence that they are
> > applied correctly.  If you lose quorum, which will happen if only one DC
> > is operational, then your application should go read-only.  This is what
> > SolrCloud does.
> >
> > I am a committer on the Apache Solr project, and Solr uses zookeeper
> > when it is running in SolrCloud mode.  The cloud code is handled by
> > other people -- I don't know much about it.
> >
> > I joined this list because I wanted to have the ZK devs include a
> > clarification in zookeeper documentation -- oddly enough, related to the
> > very thing we are discussing.  I wanted to be sure that the
> > documentation explicitly mentioned that three serversare required for a
> > fault-tolerant setup.  Some SolrCloud users don't want to accept this as
> > a fact, and believe that two servers should be enough.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: sharing a post on ZAB architecture

2016-06-01 Thread Alexander Shraer
And here's another explanation of Zab we wrote for the reconfiguration
paper, which explains ZAB in more abstract terms (without various
optimizations),
and in a way that relates it to Paxos: Section 2 in
https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf

On Wed, Jun 1, 2016 at 3:12 PM, Patrick Hunt  wrote:

> Linking from the cwiki would be great. If you send your cwiki id to
> me/flavio one of us will give you edit privs.
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index
> there is an existing articles page:
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperArticles
>
> Regards,
>
> Patrick
>
> ps. one bit of feedback - please do make sure that you highlight that the
> project is "Apache ZooKeeper" - something as simple as the first reference
> to ZooKeeper, outside the title, being "Apache ZooKeeper" with a link to
> the home page is typically sufficient.
>
> On Tue, May 31, 2016 at 11:36 AM, Michael Han  wrote:
>
> > Sounds a great idea to me.
> >
> > BTW here is another post that might be useful for those interested in
> ZAB:
> >
> >
> http://www.easonliao.org/an-implementation-of-zookeeper-atomic-broadcast-protocol/
> >
> > On Sun, May 29, 2016 at 7:21 AM, Flavio Junqueira 
> wrote:
> >
> > > This is great, Guy, thanks for sharing! I think it would be great to
> link
> > > it from the project wiki. How do others feel about it?
> > >
> > > -Flavio
> > >
> > > > On 28 May 2016, at 16:24, Guy Moshkowich 
> > > wrote:
> > > >
> > > > I would like to share a post I wrote on ZAB architecture:
> > > >
> > > >
> > >
> >
> https://distributedalgorithm.wordpress.com/2015/06/20/architecture-of-zab-zookeeper-atomic-broadcast-protocol/
> > > <
> > >
> >
> https://distributedalgorithm.wordpress.com/2015/06/20/architecture-of-zab-zookeeper-atomic-broadcast-protocol/
> > > >
> > > >
> > > > I think it can help others who are interested in understanding ZAB
> and
> > > propose to link it to the community official documentation.
> > > > Any thoughts on this proposal?
> > > >
> > > > Guy
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Cheers
> > Michael.
> >
>


Re: how to make a server be leader permanently

2016-05-02 Thread Alexander Shraer
If you're interested to work on something like that, a good starting point
could be
implementing a leader handoff API: ZOOKEEPER-2076

On Mon, May 2, 2016 at 4:19 AM, Flavio P JUNQUEIRA  wrote:

> We don't have this kind of behavior enabled because it'd affect
> availability. If your single leader fails, then the zookeeper ensemble
> becomes unavailable until the server comes back and might require manual
> intervention if the server is permanently down.
>
> Also, upon recovery, the configured leader might not have the most recent
> committed state,  which could cause you to lose data.
>
> The bottom line is that you can't really force this kind of behavior
> currently with zookeeper.
>
> -Flavio
> On 2 May 2016 11:53, "WangYQ"  wrote:
>
> i want to make a server with lower load be the zookeeper leader
> permanently. is there any method or configuration?
>


Re: Zookeeper with SSL release date

2016-04-01 Thread Alexander Shraer
Hi Shawn,

My proposal was in the following context - Flavio suggested to add new
flag(s)
to disable reconfig in order not to surprise users with new security
vulnerabilities
that arise from dynamic reconfiguration. My point was that we already have
such
a mechanism we could use - ACLs. But if we need to do that while also
allowing
unprotected use of reconfig for some users, perhaps a flag is a better
alternative.

I think we have some flexibility here since reconfig is a new feature so we
could
choose to be concervative and release it first only to people that do use
ACLs, but
I don't feel strongly about it, either way.

What do you think ?  Flavio, Patrick, what's your opinion on this ?

Cheers,
Alex

On Fri, Apr 1, 2016 at 10:16 AM, Shawn Heisey  wrote:

>
> This is a potential worry even without reconfig -- a malicious person
> could change or delete the entire database ... yet many people
> (including me) run without ACLs.
>
> My ZK ensemble is in a network location that unauthorized people can't
> reach without finding and exploiting some vulnerability that has not yet
> reached my awareness.
>
> If somebody can gain access to the ZK machines, at least one of my
> public-facing servers is already compromised.  ZK will be very low on my
> list of things to worry about.  Chances


Re: Zookeeper with SSL release date

2016-04-01 Thread Alexander Shraer
Because using reconfig without ACLs any client can remove the servers (or
replace them with a different set of servers
or change their configuration parameters) and break the system.

On Fri, Apr 1, 2016 at 8:59 AM, Jason Rosenberg <j...@squareup.com> wrote:

> I think these orthogonal concerns.  Why limit reconfig to ACL users only?
>
> On Thu, Mar 31, 2016 at 11:37 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
> > Citing Patrick:
> >
> > > If you're running zk w/o security turned on and suddenly folks can do
> > reconfig
> > > operations it's going to potentially be a problem.
> > ...
> > > Rather than force people to turn on kerberos (etc...) we could instead
> > > have the feature off
> >
> > From this I understood that the concern is mostly about users that DON'T
> > use ACLs. My proposal is to disable
> > reconfig/getconfig for all such users, forcing users who want reconfig to
> > also use ACLs. Users who do use ACLs
> > don't have to use reconfig and will have to set the ACLs on the config
> > znode before they can use it.
> >
> > In preprequestprocessor where acls are checked for reconfig operation we
> > can check that:
> >
> > skipACL = false && nodeRecord.acl != null && nodeRecord.acl.size() != 0
> >
> > meaning you're using ACLs, and have actually set ACLs on the config node.
> >
> > For getConfig its a bit trickier since its just a getData on the server
> > side (for efficiency
> > of reads, we avoided checking whether path == config znode). What we
> could
> > do is before sending
> > the operation to the server check skipACL = false and maybe also issue a
> > getACL call to check that
> > nodeRecord.acl != null && nodeRecord.acl.size() != 0
> > and only then issue a getData. This part is not air tight but its
> probably
> > sufficient.
> >
> > And of course we can emphasize the need for ACLs on this znode in the
> > release.
> >
> >
> > On Thu, Mar 31, 2016 at 1:11 PM, Flavio Junqueira <f...@apache.org>
> wrote:
> >
> > > I think Jason is saying that this is orthogonal in the following sense.
> > > You set ACLs because you care about authentication/authorization in
> your
> > > cluster, but you may not want reconfig enabled, it just happened that
> you
> > > wanted to use ACLs.
> > >
> > > Perhaps you can elaborate a bit on how you think we can perform this
> ACL
> > > check? What would you check precisely?
> > >
> > > -Flavio
> > >
> > > > On 24 Mar 2016, at 21:19, Alexander Shraer <shra...@gmail.com>
> wrote:
> > > >
> > > > I'm not so sure its orthogonal. The question is whether someone would
> > > ever
> > > > want to use reconfig without ACLs,
> > > > as this allows any client to reconfigure the servers away or add a
> > bunch
> > > of
> > > > servers that shouldn't be there :) and whether we should facilitate
> > this
> > > > knowing its insecure.
> > > >
> > > > Requiring ACLs solves the security concern for both reconfig and
> > > getconfig.
> > > > For example, if you don't want your clients to know the list of
> > servers,
> > > > limit their read permissions on the configuration znode.
> > > >
> > > > On Thu, Mar 24, 2016 at 2:11 PM, Jason Rosenberg <j...@squareup.com>
> > > wrote:
> > > >
> > > >> seems like an orthogonal requirement?
> > > >>
> > > >> On Thu, Mar 24, 2016 at 3:37 PM, Alexander Shraer <
> shra...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> How about a simpler alternative to the proposed flag for reconfig:
> a
> > > >> check
> > > >>> in the code that requires ACLs to be set.
> > > >>> If people want to use reconfig, they should use ACLs too.
> > > >>>
> > > >>> What do you think ?
> > > >>>
> > > >>> Alex
> > > >>>
> > > >>> On Mon, Mar 21, 2016 at 9:58 PM, Patrick Hunt <ph...@apache.org>
> > > wrote:
> > > >>>
> > > >>>> I would say if in doubt add a safety. (a config parameter to turn
> it
> > > >>>> off). Cost is almost zero and worst case it will just give us
> peace
> > of
> > > >>>> mind. ;-)
> > > >>>>
> > &g

Re: automatic update of server set at the client on reconfig

2016-03-31 Thread Alexander Shraer
Hi,

Please see update_addrs() function of the C client, and the following link:
https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html#ch_reconfig_rebalancing

It could be automated further (e.g., ZOOKEEPER-2016
) but there hasn't
been enough progress
on this. Any contributions very appreciated!


Cheers,
Alex

On Thu, Mar 31, 2016 at 4:54 PM, Pramod Srinivasan 
wrote:

> Hello Folks
>
> I am playing with reconfig to grow the zookeeper cluster dynamically, what
> I observed is that the C client library (don¹t know about the java client)
> does not automatically reconfigure to the new server set after reconfig.
> So if I go from Zookeeper server set [a, b, c] -> [a] -> [a, d, f] -> [d,
> f, g], the client who was connected to server [a, b, c] will loose
> connectivity to zookeeper and the session will close once we reach [d, f,
> g]. If my application monitors the server config changes and feeds the
> client library with the new server set using zoo_set_servers, the session
> continues to be in connected state. Is this observations correct?
>
> Any reason why the C client library should not automatically reconfigure
> itself with the server set by monitoring the zookeeper config path?
>
> Thanks,
> Pramod
>
>


Re: Zookeeper with SSL release date

2016-03-31 Thread Alexander Shraer
Citing Patrick:

> If you're running zk w/o security turned on and suddenly folks can do
reconfig
> operations it's going to potentially be a problem.
...
> Rather than force people to turn on kerberos (etc...) we could instead
> have the feature off

>From this I understood that the concern is mostly about users that DON'T
use ACLs. My proposal is to disable
reconfig/getconfig for all such users, forcing users who want reconfig to
also use ACLs. Users who do use ACLs
don't have to use reconfig and will have to set the ACLs on the config
znode before they can use it.

In preprequestprocessor where acls are checked for reconfig operation we
can check that:

skipACL = false && nodeRecord.acl != null && nodeRecord.acl.size() != 0

meaning you're using ACLs, and have actually set ACLs on the config node.

For getConfig its a bit trickier since its just a getData on the server
side (for efficiency
of reads, we avoided checking whether path == config znode). What we could
do is before sending
the operation to the server check skipACL = false and maybe also issue a
getACL call to check that
nodeRecord.acl != null && nodeRecord.acl.size() != 0
and only then issue a getData. This part is not air tight but its probably
sufficient.

And of course we can emphasize the need for ACLs on this znode in the
release.


On Thu, Mar 31, 2016 at 1:11 PM, Flavio Junqueira <f...@apache.org> wrote:

> I think Jason is saying that this is orthogonal in the following sense.
> You set ACLs because you care about authentication/authorization in your
> cluster, but you may not want reconfig enabled, it just happened that you
> wanted to use ACLs.
>
> Perhaps you can elaborate a bit on how you think we can perform this ACL
> check? What would you check precisely?
>
> -Flavio
>
> > On 24 Mar 2016, at 21:19, Alexander Shraer <shra...@gmail.com> wrote:
> >
> > I'm not so sure its orthogonal. The question is whether someone would
> ever
> > want to use reconfig without ACLs,
> > as this allows any client to reconfigure the servers away or add a bunch
> of
> > servers that shouldn't be there :) and whether we should facilitate this
> > knowing its insecure.
> >
> > Requiring ACLs solves the security concern for both reconfig and
> getconfig.
> > For example, if you don't want your clients to know the list of servers,
> > limit their read permissions on the configuration znode.
> >
> > On Thu, Mar 24, 2016 at 2:11 PM, Jason Rosenberg <j...@squareup.com>
> wrote:
> >
> >> seems like an orthogonal requirement?
> >>
> >> On Thu, Mar 24, 2016 at 3:37 PM, Alexander Shraer <shra...@gmail.com>
> >> wrote:
> >>
> >>> How about a simpler alternative to the proposed flag for reconfig: a
> >> check
> >>> in the code that requires ACLs to be set.
> >>> If people want to use reconfig, they should use ACLs too.
> >>>
> >>> What do you think ?
> >>>
> >>> Alex
> >>>
> >>> On Mon, Mar 21, 2016 at 9:58 PM, Patrick Hunt <ph...@apache.org>
> wrote:
> >>>
> >>>> I would say if in doubt add a safety. (a config parameter to turn it
> >>>> off). Cost is almost zero and worst case it will just give us peace of
> >>>> mind. ;-)
> >>>>
> >>>> Patrick
> >>>>
> >>>> On Mon, Mar 21, 2016 at 9:41 PM, Alexander Shraer <shra...@gmail.com>
> >>>> wrote:
> >>>>> ok, thanks for the suggestion, I'll look into it. For reconfig I
> >> think
> >>>> its
> >>>>> pretty clear that its an admin
> >>>>> functionality. I just always imagined that its controlled via acls,
> >>> but I
> >>>>> understand
> >>>>> the concerns now.
> >>>>>
> >>>>> getConfig returns the dynamic config (list of all servers with all
> >>> ports
> >>>>> and quorum system if defined)
> >>>>> and has an option to filter that info and just return the server
> >>>> connection
> >>>>> string (server and client port only).
> >>>>>
> >>>>>
> >>>>> On Mon, Mar 21, 2016 at 9:32 PM, Patrick Hunt <ph...@apache.org>
> >>> wrote:
> >>>>>
> >>>>>> On Mon, Mar 21, 2016 at 9:14 PM, Alexander Shraer <
> >> shra...@gmail.com>
> >>>>>> wrote:
> >>>>>>> I don't think that getConfig should be an admin functionality. It
> >> is

Re: Zookeeper with SSL release date

2016-03-24 Thread Alexander Shraer
I'm not so sure its orthogonal. The question is whether someone would ever
want to use reconfig without ACLs,
as this allows any client to reconfigure the servers away or add a bunch of
servers that shouldn't be there :) and whether we should facilitate this
knowing its insecure.

Requiring ACLs solves the security concern for both reconfig and getconfig.
For example, if you don't want your clients to know the list of servers,
limit their read permissions on the configuration znode.

On Thu, Mar 24, 2016 at 2:11 PM, Jason Rosenberg <j...@squareup.com> wrote:

> seems like an orthogonal requirement?
>
> On Thu, Mar 24, 2016 at 3:37 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
> > How about a simpler alternative to the proposed flag for reconfig: a
> check
> > in the code that requires ACLs to be set.
> > If people want to use reconfig, they should use ACLs too.
> >
> > What do you think ?
> >
> > Alex
> >
> > On Mon, Mar 21, 2016 at 9:58 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> > > I would say if in doubt add a safety. (a config parameter to turn it
> > > off). Cost is almost zero and worst case it will just give us peace of
> > > mind. ;-)
> > >
> > > Patrick
> > >
> > > On Mon, Mar 21, 2016 at 9:41 PM, Alexander Shraer <shra...@gmail.com>
> > > wrote:
> > > > ok, thanks for the suggestion, I'll look into it. For reconfig I
> think
> > > its
> > > > pretty clear that its an admin
> > > > functionality. I just always imagined that its controlled via acls,
> > but I
> > > > understand
> > > > the concerns now.
> > > >
> > > > getConfig returns the dynamic config (list of all servers with all
> > ports
> > > > and quorum system if defined)
> > > > and has an option to filter that info and just return the server
> > > connection
> > > > string (server and client port only).
> > > >
> > > >
> > > > On Mon, Mar 21, 2016 at 9:32 PM, Patrick Hunt <ph...@apache.org>
> > wrote:
> > > >
> > > >> On Mon, Mar 21, 2016 at 9:14 PM, Alexander Shraer <
> shra...@gmail.com>
> > > >> wrote:
> > > >> > I don't think that getConfig should be an admin functionality. It
> is
> > > >> > essential for client-side re-balancing
> > > >> > that we implemented (all clients shoudl be able to detect
> > > configuration
> > > >> > changes via getConfig). It could
> > > >> > be hidden somewhat by defining higher-level re-balancing
> > > >> > policies (ZOOKEEPER-2016)
> > > >> > but there hasn't been enough progress on that. Perhaps instead
> > > getConfig
> > > >> > should be controlled
> > > >> > by a separate flag ?
> > > >> >
> > > >>
> > > >> I believe that the Hadoop community has something we could use:
> > > >>
> > > >>
> > >
> >
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
> > > >> (whether through annotations or just documenting it in the API
> > javadoc)
> > > >>
> > > >> e.g. we could list getConfig as public/unstable for example and
> still
> > > >> ship it as GA. That would mark it as something that could change re
> > > >> API policy.
> > > >>
> > > >> Is the entire config exposed through getConfig? If so then we might
> > > >> want to enable/disable it with a flag similar to reconfig. Might be
> > > >> safer to just do that if we're not sure.
> > > >>
> > > >>
> > > >> Re classification - we could do the same thing with reconfig, but I
> > > >> think that would be a mistake. If we feel strongly where it should
> > > >> live long term we should just move it now.
> > > >>
> > > >> Patrick
> > > >>
> > > >> >
> > > >> > On Mon, Mar 21, 2016 at 9:04 PM, Patrick Hunt <ph...@apache.org>
> > > wrote:
> > > >> >
> > > >> >> On Mon, Mar 21, 2016 at 8:52 PM, Alexander Shraer <
> > shra...@gmail.com
> > > >
> > > >> >> wrote:
> > > >> >> > Hi Patrick, Flavio,
> > > >> >> >
> > > >> >> > Since there seems to be 

Re: Zookeeper with SSL release date

2016-03-24 Thread Alexander Shraer
How about a simpler alternative to the proposed flag for reconfig: a check
in the code that requires ACLs to be set.
If people want to use reconfig, they should use ACLs too.

What do you think ?

Alex

On Mon, Mar 21, 2016 at 9:58 PM, Patrick Hunt <ph...@apache.org> wrote:

> I would say if in doubt add a safety. (a config parameter to turn it
> off). Cost is almost zero and worst case it will just give us peace of
> mind. ;-)
>
> Patrick
>
> On Mon, Mar 21, 2016 at 9:41 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
> > ok, thanks for the suggestion, I'll look into it. For reconfig I think
> its
> > pretty clear that its an admin
> > functionality. I just always imagined that its controlled via acls, but I
> > understand
> > the concerns now.
> >
> > getConfig returns the dynamic config (list of all servers with all ports
> > and quorum system if defined)
> > and has an option to filter that info and just return the server
> connection
> > string (server and client port only).
> >
> >
> > On Mon, Mar 21, 2016 at 9:32 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> >> On Mon, Mar 21, 2016 at 9:14 PM, Alexander Shraer <shra...@gmail.com>
> >> wrote:
> >> > I don't think that getConfig should be an admin functionality. It is
> >> > essential for client-side re-balancing
> >> > that we implemented (all clients shoudl be able to detect
> configuration
> >> > changes via getConfig). It could
> >> > be hidden somewhat by defining higher-level re-balancing
> >> > policies (ZOOKEEPER-2016)
> >> > but there hasn't been enough progress on that. Perhaps instead
> getConfig
> >> > should be controlled
> >> > by a separate flag ?
> >> >
> >>
> >> I believe that the Hadoop community has something we could use:
> >>
> >>
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
> >> (whether through annotations or just documenting it in the API javadoc)
> >>
> >> e.g. we could list getConfig as public/unstable for example and still
> >> ship it as GA. That would mark it as something that could change re
> >> API policy.
> >>
> >> Is the entire config exposed through getConfig? If so then we might
> >> want to enable/disable it with a flag similar to reconfig. Might be
> >> safer to just do that if we're not sure.
> >>
> >>
> >> Re classification - we could do the same thing with reconfig, but I
> >> think that would be a mistake. If we feel strongly where it should
> >> live long term we should just move it now.
> >>
> >> Patrick
> >>
> >> >
> >> > On Mon, Mar 21, 2016 at 9:04 PM, Patrick Hunt <ph...@apache.org>
> wrote:
> >> >
> >> >> On Mon, Mar 21, 2016 at 8:52 PM, Alexander Shraer <shra...@gmail.com
> >
> >> >> wrote:
> >> >> > Hi Patrick, Flavio,
> >> >> >
> >> >> > Since there seems to be consensus on this, I can add this flag,
> unless
> >> >> > someone else wants to. I assume that getConfig should still work
> >> >> regardless
> >> >> > of the flag ? is there a security concern with clients knowing the
> >> list
> >> >> of
> >> >> > servers?
> >> >> >
> >> >>
> >> >> We've always hidden that detail from users. We don't even expose
> which
> >> >> server you're connected to today. I remember Ben (and perhaps
> Flavio?)
> >> >> highlighting this as important to maintain although I'm not super
> >> >> familiar with the specifics on why. It made sense to me though from
> >> >> the perspective that we don't want clients to make assumptions that
> >> >> probably shouldn't.
> >> >>
> >> >> My thinking is that we should 1) add a config option to enable
> >> >> reconfig (off by default), 2) move reconfig specific functionality
> >> >> from ZooKeeper.java (including getconfig) into an "admin" package,
> >> >> within say a class ZooKeeperAdmin, 3) document/test use of ACLs for
> >> >> when folks do want to enable reconfig and are also worried about
> auth.
> >> >> (e.g. turn on kerb)
> >> >>
> >> >> Again, I don't see any of this as a quality issue personally. As such
> >> >> I don't see

Re: Zookeeper with SSL release date

2016-03-21 Thread Alexander Shraer
ok, thanks for the suggestion, I'll look into it. For reconfig I think its
pretty clear that its an admin
functionality. I just always imagined that its controlled via acls, but I
understand
the concerns now.

getConfig returns the dynamic config (list of all servers with all ports
and quorum system if defined)
and has an option to filter that info and just return the server connection
string (server and client port only).


On Mon, Mar 21, 2016 at 9:32 PM, Patrick Hunt <ph...@apache.org> wrote:

> On Mon, Mar 21, 2016 at 9:14 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
> > I don't think that getConfig should be an admin functionality. It is
> > essential for client-side re-balancing
> > that we implemented (all clients shoudl be able to detect configuration
> > changes via getConfig). It could
> > be hidden somewhat by defining higher-level re-balancing
> > policies (ZOOKEEPER-2016)
> > but there hasn't been enough progress on that. Perhaps instead getConfig
> > should be controlled
> > by a separate flag ?
> >
>
> I believe that the Hadoop community has something we could use:
>
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
> (whether through annotations or just documenting it in the API javadoc)
>
> e.g. we could list getConfig as public/unstable for example and still
> ship it as GA. That would mark it as something that could change re
> API policy.
>
> Is the entire config exposed through getConfig? If so then we might
> want to enable/disable it with a flag similar to reconfig. Might be
> safer to just do that if we're not sure.
>
>
> Re classification - we could do the same thing with reconfig, but I
> think that would be a mistake. If we feel strongly where it should
> live long term we should just move it now.
>
> Patrick
>
> >
> > On Mon, Mar 21, 2016 at 9:04 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> >> On Mon, Mar 21, 2016 at 8:52 PM, Alexander Shraer <shra...@gmail.com>
> >> wrote:
> >> > Hi Patrick, Flavio,
> >> >
> >> > Since there seems to be consensus on this, I can add this flag, unless
> >> > someone else wants to. I assume that getConfig should still work
> >> regardless
> >> > of the flag ? is there a security concern with clients knowing the
> list
> >> of
> >> > servers?
> >> >
> >>
> >> We've always hidden that detail from users. We don't even expose which
> >> server you're connected to today. I remember Ben (and perhaps Flavio?)
> >> highlighting this as important to maintain although I'm not super
> >> familiar with the specifics on why. It made sense to me though from
> >> the perspective that we don't want clients to make assumptions that
> >> probably shouldn't.
> >>
> >> My thinking is that we should 1) add a config option to enable
> >> reconfig (off by default), 2) move reconfig specific functionality
> >> from ZooKeeper.java (including getconfig) into an "admin" package,
> >> within say a class ZooKeeperAdmin, 3) document/test use of ACLs for
> >> when folks do want to enable reconfig and are also worried about auth.
> >> (e.g. turn on kerb)
> >>
> >> Again, I don't see any of this as a quality issue personally. As such
> >> I don't see why any of this (1-3) should hold up a 3.5.2-alpha if we
> >> were interested in doing such a release. Adjusting the API should be
> >> done before we move to "beta" though. Although that seems like a
> >> pretty mechanical (eclipse/idea) type refactoring?
> >>
> >> Patrick
> >>
> >> > Cheers,
> >> > Alex
> >> > On Mar 21, 2016 8:34 PM, "Patrick Hunt" <ph...@apache.org> wrote:
> >> >
> >> >> On Thu, Mar 17, 2016 at 4:08 PM, Flavio Junqueira <f...@apache.org>
> >> wrote:
> >> >> > I gotta say that I'm not super excited about this option, but for
> some
> >> >> reason I ended up carrying the flag. To recap, I just raised this
> option
> >> >> because it seems that there are folks interested in features in 3.5
> like
> >> >> SSL and not necessarily in reconfiguration. SSL is important and to
> take
> >> >> Kafka as an example, it sucks that we can't have a whole set up using
> >> SSL.
> >> >> For ZK, the real questions are:
> >> >> >
> >> >> > 1- how fast can we make 3.5 stable?
> >> >> > 2- wo

Re: Zookeeper with SSL release date

2016-03-21 Thread Alexander Shraer
another thing - shouldn't things like setting quotas also be part of the
admin API ? how does that
work now ?

Alex

On Mon, Mar 21, 2016 at 9:14 PM, Alexander Shraer <shra...@gmail.com> wrote:

> I don't think that getConfig should be an admin functionality. It is
> essential for client-side re-balancing
> that we implemented (all clients shoudl be able to detect configuration
> changes via getConfig). It could
> be hidden somewhat by defining higher-level re-balancing
> policies (ZOOKEEPER-2016)
> but there hasn't been enough progress on that. Perhaps instead getConfig
> should be controlled
> by a separate flag ?
>
> Alex
>
> On Mon, Mar 21, 2016 at 9:04 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> On Mon, Mar 21, 2016 at 8:52 PM, Alexander Shraer <shra...@gmail.com>
>> wrote:
>> > Hi Patrick, Flavio,
>> >
>> > Since there seems to be consensus on this, I can add this flag, unless
>> > someone else wants to. I assume that getConfig should still work
>> regardless
>> > of the flag ? is there a security concern with clients knowing the list
>> of
>> > servers?
>> >
>>
>> We've always hidden that detail from users. We don't even expose which
>> server you're connected to today. I remember Ben (and perhaps Flavio?)
>> highlighting this as important to maintain although I'm not super
>> familiar with the specifics on why. It made sense to me though from
>> the perspective that we don't want clients to make assumptions that
>> probably shouldn't.
>>
>> My thinking is that we should 1) add a config option to enable
>> reconfig (off by default), 2) move reconfig specific functionality
>> from ZooKeeper.java (including getconfig) into an "admin" package,
>> within say a class ZooKeeperAdmin, 3) document/test use of ACLs for
>> when folks do want to enable reconfig and are also worried about auth.
>> (e.g. turn on kerb)
>>
>> Again, I don't see any of this as a quality issue personally. As such
>> I don't see why any of this (1-3) should hold up a 3.5.2-alpha if we
>> were interested in doing such a release. Adjusting the API should be
>> done before we move to "beta" though. Although that seems like a
>> pretty mechanical (eclipse/idea) type refactoring?
>>
>> Patrick
>>
>> > Cheers,
>> > Alex
>> > On Mar 21, 2016 8:34 PM, "Patrick Hunt" <ph...@apache.org> wrote:
>> >
>> >> On Thu, Mar 17, 2016 at 4:08 PM, Flavio Junqueira <f...@apache.org>
>> wrote:
>> >> > I gotta say that I'm not super excited about this option, but for
>> some
>> >> reason I ended up carrying the flag. To recap, I just raised this
>> option
>> >> because it seems that there are folks interested in features in 3.5
>> like
>> >> SSL and not necessarily in reconfiguration. SSL is important and to
>> take
>> >> Kafka as an example, it sucks that we can't have a whole set up using
>> SSL.
>> >> For ZK, the real questions are:
>> >> >
>> >> > 1- how fast can we make 3.5 stable?
>> >> > 2- would it be faster if we have a way of disabling reconfiguration?
>> >> > 3- would enough users care about a stable 3.5 that has
>> reconfiguration
>> >> disabled?
>> >> >
>> >> > It is taking a long time to get 3.5 to beta. There has been some good
>> >> activity around 3.5.2 release, which is a great step, but it is unclear
>> >> when 3.5.3 is going to come and if we will be able to make 3.5 beta
>> then.
>> >> Frankly, disabling reconfiguration sounds undesirable because it is an
>> >> important feature, but I currently don't use it in production, so from
>> a
>> >> practical point of view, I can go both ways. Whether we go through the
>> >> trouble of doing 2 depends on users interested in that option and folks
>> >> willing to implement it.
>> >> >
>> >> > To answer your question, Powell, my pseudo-proposal is kind of a
>> funny
>> >> option because once the feature is stable, then we wouldn't need a
>> switch
>> >> any longer, so there is not need of a deprecation path, we just start
>> >> ignoring it from the first beta release. Until it is beta, I'd say that
>> >> default is disabled.
>> >>
>> >> I would argue that we need this even when it does become stable. To me
>> >> this is not a quality issue so much as it is an auth issue. We want to
>

Re: Zookeeper with SSL release date

2016-03-19 Thread Alexander Shraer
Looking at the list of ~50 blocker and critical bugs in ZooKeeper, only 3-4
are related to reconfig. Given this, and the fact that it is run in
production since 2012 in multiple companies, I don't think its more
unstable than any other part of ZooKeeper.

There are multiple reconfig-related bugs that turned out really difficult
to debug without access to the actual system or at least to the Hudson
machines where some tests are failing. In the past, Michi and I physically
went to Hortonworks to debug one such issue, but this is of course not a
good method, and we weren't able to arrange such a visit again.

Regarding making it optional - the reconfig logic has several different
parts, some would be really difficult to disable using a configuration
parameter. But the actual dynamic expansion of the cluster is triggered by
the reconfig command, so it should not affect users who don't invoke it.

On Wed, Mar 16, 2016 at 1:09 PM, Flavio P JUNQUEIRA  wrote:

> I suppose we could give it a try. How do other folks feel about it?
>
> -Flavio
> On 16 Mar 2016 19:52, "Jason Rosenberg"  wrote:
>
> > So, you could enable the dynamic reconfiguration feature behind a config
> > option, and document that it should only be enabled experimentally, use
> at
> > your own risk.  Keep it off by default.  Allow only static config by
> > default, until it's stable?
> >
> > Jason
> >
> > On Wed, Mar 16, 2016 at 3:34 PM, Flavio Junqueira 
> wrote:
> >
> > > Hi Jason,
> > >
> > > The consumer in Kafka is pretty independent from the core (brokers),
> > > that's how that project manages to make such a separation. We don't
> have
> > > the same with ZooKeeper as the feature we are talking about is part of
> > the
> > > server and the only way I see of doing what you say is to turn off
> > > features. More specifically, we'd need to disable the reconfig API and
> do
> > > not allow any change to the configuration, even though the code is
> there.
> > >
> > > Reconfig here refers to the ability of changing the configuration of an
> > > ensemble (e.g., changing the set of servers).
> > >
> > > -Flavio
> > >
> > > > On 16 Mar 2016, at 19:14, Jason Rosenberg  wrote:
> > > >
> > > > So, it would seem sensible to me to have a release where all features
> > are
> > > > stable, except where noted.  E.g. mark certain features as only
> 'alpha
> > > > quality', e.g. the 're-config feature'.  (I assume it's possible to
> > > happily
> > > > use 3.5.X without exercising the unstable re-config bits?).
> > > >
> > > > There's precedent for doing this sort of thing in other projects,
> e.g.
> > in
> > > > Kafka, they've had several release where a new "Consumer API" is
> > shipped
> > > > that is available for beta-testing, but you can still just use the
> > older
> > > > stable consumer api, etc.
> > > >
> > > > Jason
> > > >
> > > > On Wed, Mar 16, 2016 at 2:01 PM, powell molleti
> > >  > > >> wrote:
> > > >
> > > >> Hi Doug,
> > > >> Is 3.5 being an alpha release preventing you from using it?. Or have
> > you
> > > >> run into issues with it?. In general perhaps ZK 3.5 being labeled as
> > > alpha
> > > >> might not be fair, since it is far more stable then what most people
> > > >> associate an alpha release to be.
> > > >> Perhaps if you do not use re-config feature may be it will just work
> > for
> > > >> you?.
> > > >> There are many examples of 3.5.X being used in productions from my
> > > limited
> > > >> knowledge.
> > > >> ThanksPowell.
> > > >>
> > > >>On Wednesday, March 16, 2016 2:44 AM, Flavio Junqueira <
> > > f...@apache.org>
> > > >> wrote:
> > > >>
> > > >>
> > > >> None of us expected the reconfig changes to take this long to
> > stabilize.
> > > >> Until we get there, I don't think we can do anything else with 3.5.
> > The
> > > >> best bet we have is to work harder to bring 3.5 into a stable rather
> > > than
> > > >> trying to work around it.
> > > >>
> > > >> There are lots of people interested in seeing 3.5 stable, and if we
> > get
> > > >> everyone to contribute more patches and code reviews, we should be
> > able
> > > to
> > > >> do it sooner. After all, it is a community based effort, so the
> > > community
> > > >> shouldn't rely on just 2-3 people doing the work.
> > > >>
> > > >> -Flavio
> > > >>
> > > >>> On 15 Mar 2016, at 17:28, Chris Nauroth 
> > > >> wrote:
> > > >>>
> > > >>> Doug, I forgot to respond to your question about 3.4.  Since 3.4 is
> > the
> > > >>> stable maintenance line, we are very conservative about
> back-porting
> > to
> > > >>> it.  Our policy is to limit back-ports to critical bug fixes and
> not
> > > >>> introduce any new features in the 3.4 line.  This is a matter of
> > > managing
> > > >>> risk.
> > > >>>
> > > >>> Jason, your question about release cadence is a fair one.  If it's
> > any
> > > >>> consolation, we are now taking the approach of trying to 

Re: Zookeeper with SSL release date

2016-03-19 Thread Alexander Shraer
Here is a link for bugs marked as 3.5.2:
https://issues.apache.org/jira/browse/ZOOKEEPER/fixforversion/12331981/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

The API issue Flavio mentioned is
https://issues.apache.org/jira/browse/ZOOKEEPER-2014 personally I don't
think this issue
is significant enough to block the release, but I may be wrong. ZooKeeper
supports ACLs and these can be used to solve the
issue described in the JIRA, at least until a better solution is in place.


Alex


On Wed, Mar 16, 2016 at 9:33 PM, Jason Rosenberg <j...@squareup.com> wrote:

> Forgive me, as I have not long been an active member of the zookeeper
> community (other than as a grateful user over the last 3 years or so).
>
> If I understand correclty, 3.5.X has been alpha for several years or so
> now?  I think if there isn't a plan to have a stable release soon (say
> within another year), it's a problem.  It should be about having a regular
> release cycle, with the hope that new features and bug fixes become
> available in a reasonable time.  If one feature is just not stable, then it
> shouldn't block other features, etc.  Saying a feature is a major part of
> 3.5, doesn't quite make sense in this formulation.  Instead releases
> incorporate features, and if features get delayed, they can be postponed to
> a subsequent release.
>
> The issue is that we have people saying they don't want to fix things in
> 3.4.X (or back port fixes from 3.5.X to 3.4.X).  But if 3.5.X is still
> literally still years away (after having been under development for years),
> we should re-evaluate, no?
>
> Jason
>
> On Wed, Mar 16, 2016 at 8:46 PM, Patrick Hunt <ph...@apache.org> wrote:
>
> > I'm not a huge fan of turning it off to be honest. Also just turning
> > it off at the API level wouldn't be enough, we'd need to turn it off
> > at the protocol level (otw it could still be accessed).
> >
> > I'd rather see us address it than kick it down the road. It's a major
> > feature of 3.5.
> >
> > Patrick
> >
> > On Wed, Mar 16, 2016 at 2:46 PM, Flavio Junqueira <f...@apache.org>
> wrote:
> > > The main issue to sort out is stability of the API. There is a security
> > concern around the fact that clients can freely reconfigure the ensemble.
> > If we follow the plan that Pat proposed some time ago:
> > >
> > >
> >
> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201407.mbox/%3CCANLc_9KG6-Dhm=wwfuwzniogk70pg+ihmhpigyfjdslf9-e...@mail.gmail.com%3E
> > <
> >
> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201407.mbox/%3CCANLc_9KG6-Dhm=wwfuwzniogk70pg+ihmhpigyfjdslf9-e...@mail.gmail.com%3E
> > >
> > >
> > > Locking the API is the main step to move it to beta. Sorting out bugs
> is
> > definitely necessary, but it isn't the main thing that is keeping 3.5 in
> > alpha.
> > >
> > > About making it experimental, I was raising the option of having a
> > switch that disables the API calls, not the code. The reason why that
> could
> > work is that anyone using 3.5 who uses the "experimental" API must
> explicit
> > turn on the switch and enable the calls. If they do it, they need to be
> > aware that the API can change.
> > >
> > >  I must say that I haven't really looked closely into doing it, and I'm
> > not even entirely convinced that this is a good idea, but since Jason
> > raised the point, I'm exploring options.
> > >
> > > -Flavio
> > >
> > >> On 16 Mar 2016, at 20:59, Alexander Shraer <shra...@gmail.com> wrote:
> > >>
> > >> Looking at the list of ~50 blocker and critical bugs in ZooKeeper,
> only
> > 3-4
> > >> are related to reconfig. Given this, and the fact that it is run in
> > >> production since 2012 in multiple companies, I don't think its more
> > >> unstable than any other part of ZooKeeper.
> > >>
> > >> There are multiple reconfig-related bugs that turned out really
> > difficult
> > >> to debug without access to the actual system or at least to the Hudson
> > >> machines where some tests are failing. In the past, Michi and I
> > physically
> > >> went to Hortonworks to debug one such issue, but this is of course
> not a
> > >> good method, and we weren't able to arrange such a visit again.
> > >>
> > >> Regarding making it optional - the reconfig logic has several
> different
> > >> parts, some would be really difficult to disable using a configuration
> > >> parameter. But the actu

Re: zookeeper client session write-read consistency

2016-03-07 Thread Alexander Shraer
The server to which the client is connected will buffer the read until the
write is executed and applied to its state, so the read will necessarily
return a value at least as recent as the one written by the write in your
example. ZK guarantees that async operations are executed in order of
invocation.
On Mar 6, 2016 23:57, "wayne"  wrote:

Thanks Chris! I appreciate the answer a lot!

What you said made perfect sense in the case that request are sent
synchronously (which was my assumption :)). What if the requests are sent
asynchronously? e.g. If I call AsyncWrite, AsyncRead within a session, when
the AsyncRead is executed, the previous AsyncWrite's result might not have
been returned to the client yet, then there is no way for the client to know
the previous AsyncWrite's zxid, correct? In that case, could the situation I
mentioned in my previous post happen?



--
View this message in context:
http://zookeeper-user.578899.n2.nabble.com/zookeeper-client-session-write-read-consistency-tp7579330p7582099.html
Sent from the zookeeper-user mailing list archive at Nabble.com.


Re: ZooKeeper transaction properties (partial read)

2016-02-02 Thread Alexander Shraer
Hi,

I think this situation shouldn't happen - the result of step 3 implies that
the multi happened, so 4 should see it (multi should be atomic). Could you
please open a high priority bug for this ?

Thanks,
Alex

On Mon, Feb 1, 2016 at 2:40 PM, Whitney Sorenson <wsoren...@hubspot.com>
wrote:

> 1. Starting state : { /foo = , /bar =  }
> 2. In a multi, write: { /foo = A, /bar = B}
> 3. Read /foo as A
> 4. Read /bar as 
>
> #3 and #4 are issued 100% sequentially.
>
> It is not known at what point during #2, #3 starts.
>
> - Whitney
>
>
> On Mon, Feb 1, 2016 at 5:08 PM, Jared Cantwell <jared.cantw...@gmail.com>
> wrote:
>
> > Am I understanding you correctly that you are observing the following?
> >
> >1. Starting state {/foo = A , /bar = B}
> >2. In a multi, write {/foo->A', /bar = B'}
> >3. Read /foo as A'
> >4. Read /bar as B
> >
> > #3 & #4 start after #2 completes entirely, right?  And #3 & #4 are issued
> > 100% sequentially?
> >
> > ~Jared
> >
> > On Mon, Feb 1, 2016 at 2:54 PM, Alexander Shraer <shra...@gmail.com>
> > wrote:
> >
> > > Reading the 965 JIRA what you're describing sounds like a bug.
> > >
> > > Alex
> > >
> > > On Mon, Feb 1, 2016 at 10:41 AM, Whitney Sorenson <
> wsoren...@hubspot.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > In searching through the ZK documentation, this list,
> > > > https://issues.apache.org/jira/browse/ZOOKEEPER-965, and curator
> > > > documentation (which we're using to talk to ZK) I can't find anything
> > > > definitive explaining the guarantees around using transactions.
> > > >
> > > > I am now beginning to wonder if I wrongly assumed that partial reads
> > were
> > > > not possible during transactions, because I have now observed this
> > > behavior
> > > > twice.
> > > >
> > > > I'd just like to confirm that this is the expected behavior:
> > > >
> > > > ZK 3.4.6
> > > > Curator 2.8.0
> > > >
> > > > I write 2 nodes in a transaction, however, I am able to see one of
> the
> > > > nodes without seeing the other in 2 subsequent reads.
> > > >
> > > > Eventually, I am able to see both nodes. When I am talking about
> seeing
> > > the
> > > > nodes I am talking about reads from the same client which issued the
> > > > transaction talking to the same ZK server.
> > > >
> > > > - Whitney
> > > >
> > >
> >
>


Re: ZooKeeper transaction properties (partial read)

2016-02-01 Thread Alexander Shraer
Reading the 965 JIRA what you're describing sounds like a bug.

Alex

On Mon, Feb 1, 2016 at 10:41 AM, Whitney Sorenson 
wrote:

> Hi,
>
> In searching through the ZK documentation, this list,
> https://issues.apache.org/jira/browse/ZOOKEEPER-965, and curator
> documentation (which we're using to talk to ZK) I can't find anything
> definitive explaining the guarantees around using transactions.
>
> I am now beginning to wonder if I wrongly assumed that partial reads were
> not possible during transactions, because I have now observed this behavior
> twice.
>
> I'd just like to confirm that this is the expected behavior:
>
> ZK 3.4.6
> Curator 2.8.0
>
> I write 2 nodes in a transaction, however, I am able to see one of the
> nodes without seeing the other in 2 subsequent reads.
>
> Eventually, I am able to see both nodes. When I am talking about seeing the
> nodes I am talking about reads from the same client which issued the
> transaction talking to the same ZK server.
>
> - Whitney
>


Re: Apache ZooKeeper Meetup - Jan 27, Cloudera HQ

2016-01-28 Thread Alexander Shraer
and here's my presentation: https://goo.gl/xrDCZT,
please feel free to comment, feedback is very appreciated.

Alex

On Thu, Jan 28, 2016 at 6:33 PM, powell molleti <powell...@yahoo.com.invalid
> wrote:

> Here is the link to the presentation. Feel free to comment I have enabled
> comments.
> https://goo.gl/BPWtzc
> thanksPowell.
>
> On Thursday, January 28, 2016 7:45 AM, Patrick Hunt <ph...@apache.org>
> wrote:
>
>
>  Thanks to everyone who attended last night! Including our remote
> folks. Sorry to have to kick everyone out at 9pm but I had to head
> home. ;-)
>
> We had some great discussions and some great presentations. The work
> that Powell showed off around leader election sparked quite a debate!
> I'm interested to see how that will develop, looks very positive.
> Alexander's presentation on the work Kfir Lev-Ari has been doing with
> ZOOKEEPER-2024 to improve performance of mixed workloads got a lot of
> interest from the participants.
>
> Here's my short presentation on 3.5 status. Hopefully Powell and Alex
> can post links to their presentations as well.
>
>
> https://docs.google.com/presentation/d/1xZULXnEbNxOfgnGGw_PFHWCHqqzcf-4Ww66Y5BERSYM/edit?usp=sharing
>
> Patrick
>
> On Sun, Jan 24, 2016 at 3:12 PM, Patrick Hunt <ph...@apache.org> wrote:
> > I believe Webex should work, I setup the following and will try to get
> > this working with my laptop for the duration of the meetup:
> >
> > --
> > Apache ZooKeeper Meetup Jan 27 2016
> > Wednesday, January 27, 2016
> > 6:00 pm | Pacific Standard Time (San Francisco, GMT-08:00) | 2 hrs
> >
> >
> > JOIN WEBEX MEETING
> >
> https://cloudera.webex.com/cloudera/j.php?MTID=md8052b07bfa53ab5e72d0a94260c9800
> > Meeting number: 622 396 146 Meeting password: zk2016
> >
> >
> > JOIN BY PHONE 1-650-479-3208 Call-in toll number (US/Canada) Access
> > code: 622 396 146 Global call-in
> > numbers:
> https://cloudera.webex.com/cloudera/globalcallin.php?serviceType=MC=443047232=0
> > Add this meeting to your calendar (Cannot add from mobile devices):
> >
> https://cloudera.webex.com/cloudera/j.php?MTID=m346c19d259b1c8f6de7b0ed350ef8cd1
> > Can't join the meeting? Contact support here:
> > https://cloudera.webex.com/cloudera/mc
> > IMPORTANT NOTICE: Please note that this WebEx service allows audio and
> > other information sent during the session to be recorded, which may be
> > discoverable in a legal matter. You should inform all meeting
> > attendees prior to recording if you intend to record the meeting.
> >
> > On Thu, Jan 21, 2016 at 11:31 AM, Alexander Shraer <shra...@gmail.com>
> wrote:
> >> Thanks for organizing!
> >>
> >> If possible, I'd like to give a short presentation (10 min ?) about
> Kfir's
> >> work on ZOOKEEPER-2024.
> >> I think its a very important improvement and we should get this in 3.5
> >>
> >> Cheers,
> >> Alex
> >>
> >> On Thu, Jan 21, 2016 at 9:31 AM, Rakesh Radhakrishnan <
> >> rakeshr.apa...@gmail.com> wrote:
> >>
> >>> Thank you for organizing this Flavio!  I'm interested to attend
> remotely.
> >>>
> >>> Best Regards,
> >>> Rakesh
> >>>
> >>> On Wed, Jan 20, 2016 at 11:19 PM, Raúl Gutiérrez Segalés <
> >>> r...@itevenworks.net> wrote:
> >>>
> >>> > Thanks for setting this up Flavio! See you all there!
> >>> >
> >>> > On 20 January 2016 at 08:35, Flavio Junqueira <f...@apache.org>
> wrote:
> >>> >
> >>> > > Hello!
> >>> > >
> >>> > > We are organizing a meetup in the Bay Area next week, and I'd love
> to
> >>> see
> >>> > > everyone who is the area there. Please check the event page and
> don't
> >>> > > forget to RSVP:
> >>> > >
> >>> > >
> >>>
> https://www.eventbrite.com/e/apache-zookeeper-meetup-tickets-20906479844
> >>> > <
> >>> > >
> >>>
> https://www.eventbrite.com/e/apache-zookeeper-meetup-tickets-20906479844
> >>> > >
> >>> > >
> >>> > > Also, it'd be great to have folks speaking about stuff around ZK.
> If
> >>> > > you're interested, let me know and I'll add you to the agenda.
> >>> > >
> >>> > > See you there!
> >>> > >
> >>> > > -Flavio
> >>> >
> >>>
>
>
>


Re: Apache ZooKeeper Meetup - Jan 27, Cloudera HQ

2016-01-21 Thread Alexander Shraer
Thanks for organizing!

If possible, I'd like to give a short presentation (10 min ?) about Kfir's
work on ZOOKEEPER-2024.
I think its a very important improvement and we should get this in 3.5

Cheers,
Alex

On Thu, Jan 21, 2016 at 9:31 AM, Rakesh Radhakrishnan <
rakeshr.apa...@gmail.com> wrote:

> Thank you for organizing this Flavio!  I'm interested to attend remotely.
>
> Best Regards,
> Rakesh
>
> On Wed, Jan 20, 2016 at 11:19 PM, Raúl Gutiérrez Segalés <
> r...@itevenworks.net> wrote:
>
> > Thanks for setting this up Flavio! See you all there!
> >
> > On 20 January 2016 at 08:35, Flavio Junqueira  wrote:
> >
> > > Hello!
> > >
> > > We are organizing a meetup in the Bay Area next week, and I'd love to
> see
> > > everyone who is the area there. Please check the event page and don't
> > > forget to RSVP:
> > >
> > >
> https://www.eventbrite.com/e/apache-zookeeper-meetup-tickets-20906479844
> > <
> > >
> https://www.eventbrite.com/e/apache-zookeeper-meetup-tickets-20906479844
> > >
> > >
> > > Also, it'd be great to have folks speaking about stuff around ZK. If
> > > you're interested, let me know and I'll add you to the agenda.
> > >
> > > See you there!
> > >
> > > -Flavio
> >
>


Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

2016-01-13 Thread Alexander Shraer
I may be wrong but I don't think that being idempotent gives you what you
said. Just because f(f(x))=f(x) doesn't mean that f(g(f(x))) = g(f(x)) --
this was my example. But if your system can detect that X was already
executed (or if the operations are conditional on state) my scenario indeed
can't happen.


On Wed, Jan 13, 2016 at 2:08 AM, singh.janmejay <singh.janme...@gmail.com>
wrote:

> @Alexander: In that scenario, write of X will be attempted by A, but
> external system will not act upon write-X because that operation has
> already been acted upon in the past. This is guaranteed by idempotent
> operations invariant. But it does point out another problem, which I
> hadn't handled in my original algorithm. Problem: If X and Y have both
> not been issued yet, and if Y is issued before X towards external
> system, because neither operations have executed yet, it'll overwrite
> Y with X. I need another constraint, master should only issue 1
> operation on a certain external-system at a time and must issue
> operations in the order of operation-id (sequential-znode sequence
> number). So we need the following invariants:
> - order of issuing operation being fixed (matching order of creation
> of operations)
> - concurrency of operation fixed to 1
> - idempotent execution on external-system side
>
> @Powell: Im kind of doing the same thing. Except the loop doesn't run
> on consumer, instead it runs on master, which is assigning work to
> consumers. So triggerWork function is basically changed to issueWork,
> which is RPC + triggerWork. The replay if history is basically just
> replay of 1 operation per operand-node (in this thread we are calling
> it external-system), so its as if triggerWork failed, in which case we
> need to re-execute triggerWork. Idempotency also follows from that
> requirement. If triggerWork fails in the last step, and all the
> desired effect that was necessary has happened, we will still need to
> run triggerWork again, but we need awareness that actual work has been
> done, which is why idempotency is necessary.
>
> Btw, thanks for continuing to spare time for this, I really appreciate
> this feedback/validation.
>
> Thoughts?
>
> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
> <powell...@yahoo.com.invalid> wrote:
> > Wouldn't a distributed queue recipe for consumer work?. Where one needs
> to add extra logic something like this:
> >
> > with lock() {
> > p = queue.peek()
> > if triggerWork(p) is Done:
> > queue.pop()
> > }
> >
> > With this a consumer that worked on it but crashed before popping the
> queue would result in another consumer processing the same work.
> >
> > I am not sure with the details of where you are getting the work from
> and the scale of it is but producers(leader) could write to this queue.
> Since there is guarantee of read after write , producer could delete from
> its local queue the work that was successfully queued. Hence again new
> producer could re-send the last entry of work so one has to handle that.
> Without more details on the origin of work etc its hard to design end to
> end.
> >
> > I do not see a need to do a total replay of past history etc when using
> ZK like system because ZK is built on idea of serialized and replicated
> log, hence if you are using ZK then your design should be much simpler i.e
> fail and re-start from last know transaction.
> >
> > Powell.
> >
> >
> >
> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
> shra...@gmail.com> wrote:
> > Hi,
> >
> > With your suggestion, the following scenario seems possible: master A is
> > about to write value X to an external system so it logs it to ZK, then
> > freezes for some time, ZK suspects it as failed, another master B is
> > elected, writes X (completing what A wanted to do)
> > then starts doing something else and writes Y. Then leader A "wakes up"
> and
> > re-executes the old operation writing X which is now stale.
> >
> > perhaps if your external system supports conditional updates this can be
> > avoided - a write of X only works if the current state is as expected.
> >
> > Alex
> >
> >
> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <singh.janme...@gmail.com
> >
> > wrote:
> >
> >> Thanks for the replies everyone, most of it was very useful.
> >>
> >> @Alexander: The section of chubby paper you pointed me to tries to
> >> address exactly what I was looking for. That clearly is one good way
> >> of doing it. Im also thinking of an alternative way and can use a
> >> review o

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

2016-01-12 Thread Alexander Shraer
Hi,

With your suggestion, the following scenario seems possible: master A is
about to write value X to an external system so it logs it to ZK, then
freezes for some time, ZK suspects it as failed, another master B is
elected, writes X (completing what A wanted to do)
then starts doing something else and writes Y. Then leader A "wakes up" and
re-executes the old operation writing X which is now stale.

perhaps if your external system supports conditional updates this can be
avoided - a write of X only works if the current state is as expected.

Alex

On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <singh.janme...@gmail.com>
wrote:

> Thanks for the replies everyone, most of it was very useful.
>
> @Alexander: The section of chubby paper you pointed me to tries to
> address exactly what I was looking for. That clearly is one good way
> of doing it. Im also thinking of an alternative way and can use a
> review or some feedback on that.
>
> @Powel: About x509 auth on intra-cluster communication, I don't have a
> blocking need for it, as it can be achieved by setting up firewall
> rules to accept only from desired hosts. It may be a good-to-have
> thing though, because in cloud-based scenarios where IP addresses are
> re-used, a recycled IP can still talk to a secure zk-cluster unless
> config is changed to remove the old peer IP and replace it with the
> new one. Its clearly a corner-case though.
>
> Here is the approach Im thinking of:
> - Implement all operations(atleast master-triggered operations) on
> operand machines idempotently
> - Have master journal these operations to ZK before issuing RPC
> - In case master fails with some of these operations in flight, the
> newly elected master will need to read all issued (but not retired
> yet) operations and issue them again.
> - Existing master(before failure or after failure) can retry and
> retire operations according to whatever the retry policy and
> success-criterion is.
>
> Why am I thinking of this as opposed to going with chubby sequencer
> passing:
> - I need to implement idempotency regardless, because recovery-path
> involving master-death after successful execution of operation but
> before writing ack to coordination service requires it. So idempotent
> implementation complexity is here to stay.
> - I need to increase surface-area of the architecture which is exposed
> to coordination-service for sequencer validation. Which may bring
> verification RPC in data-plane in some cases.
> - The sequencer may expire after verification but before ack, in which
> case new master may not recognize the operation as consistent with its
> decisions (or previous decision path).
>
> Thoughts? Suggestions?
>
>
>
> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
> > regarding atomic multi-znode updates -- check out "multi" updates
> > <
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> >
> > .
> >
> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
> >
> >> for 1, see the chubby paper
> >> <
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> >,
> >> section 2.4.
> >> for 2, I'm not sure I fully understand the question, but essentially, ZK
> >> guarantees that even during failures
> >> consistency of updates is preserved. The user doesn't need to do
> anything
> >> in particular to guarantee this, even
> >> during leader failures. In such case, some suffix of operations executed
> >> by the leader may be lost if they weren't
> >> previously acked by a majority.However, none of these operations could
> >> have been visible
> >> to reads.
> >>
> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> >> powell...@yahoo.com.invalid> wrote:
> >>
> >>> Hi Janmejay,
> >>> Regarding question 1, if a node takes a lock and the lock has timed-out
> >>> from system perspective then it can mean that other nodes are free to
> take
> >>> the lock and work on the resource. Hence the history could be well
> into the
> >>> future when the previous node discovers the time-out. The question of
> >>> rollback in the specific context depends on the implementation
> details, is
> >>> the lock holder updating some common area?, then there could be
> corruption
> >>> since other nodes are free to write in parallel to the first one?. In
> the
> >>> usual sense a time-out of lock held means the node which held the lock
> is
> >>> dead. 

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

2016-01-03 Thread Alexander Shraer
regarding atomic multi-znode updates -- check out "multi" updates
<http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html>
.

On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <shra...@gmail.com> wrote:

> for 1, see the chubby paper
> <http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf>,
> section 2.4.
> for 2, I'm not sure I fully understand the question, but essentially, ZK
> guarantees that even during failures
> consistency of updates is preserved. The user doesn't need to do anything
> in particular to guarantee this, even
> during leader failures. In such case, some suffix of operations executed
> by the leader may be lost if they weren't
> previously acked by a majority.However, none of these operations could
> have been visible
> to reads.
>
> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> powell...@yahoo.com.invalid> wrote:
>
>> Hi Janmejay,
>> Regarding question 1, if a node takes a lock and the lock has timed-out
>> from system perspective then it can mean that other nodes are free to take
>> the lock and work on the resource. Hence the history could be well into the
>> future when the previous node discovers the time-out. The question of
>> rollback in the specific context depends on the implementation details, is
>> the lock holder updating some common area?, then there could be corruption
>> since other nodes are free to write in parallel to the first one?. In the
>> usual sense a time-out of lock held means the node which held the lock is
>> dead. It is upto the implementation to ensure this case and, using this
>> primitive, if there is a timeout which means other nodes are sure that no
>> one else is working on the resource and hence can move forward.
>> Question 2 seems to imply the assumption that leader has significant work
>> todo and leader change is quite common, which seems contrary to common
>> implementation pattern. If the work can be broken down into smaller chunks
>> which need serialization separately then each chunk/work type can have a
>> different leader.
>> For question 3, ZK does support auth and encryption for client
>> connections but not for inter ZK node channels. Do you have requirement to
>> secure inter ZK nodes, can you let us know what your requirements are so we
>> can implement a solution to fit all needs?.
>> For question 4 the official implementation is C, people tend to wrap that
>> with C++ and there should projects that use ZK doing that you can look them
>> up and see if you can separate it out and use them.
>> Hope this helps.Powell.
>>
>>
>>
>> On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>> edward.capri...@huffingtonpost.com> wrote:
>>
>>
>>  Q:What is the best way of handling distributed-lock expiry? The owner
>> of the lock managed to acquire it and may be in middle of some
>> computation when the session expires or lock expire
>>
>> If you are using Java a way I can see doing this is by using the
>> ExecutorCompletionService
>>
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>> .
>> It allows you to keep your workers in a group, You can poll the group and
>> provide cancel semantics as needed.
>> An example of that service is here:
>>
>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>> where I am issuing multiple reads and I want to abandon the process if
>> they
>> do not timeout in a while. Many async/promices frameworks do this by
>> launching two task ComputationTask and a TimeoutTask that returns in 10
>> seconds. Then they ask the completions service to poll. If the service is
>> given the TimoutTask after the timeout that means the Computation did not
>> finish in time.
>>
>> Do people generally take action in middle of the computation (abort it and
>> do itin a clever way such that effect appears atomic, so abort is
>> notreally
>> visible, if so what are some of those clever ways)?
>>
>> The base issue is java's synchronized/ AtomicReference do not have a
>> rollback.
>>
>> There are a few ways I know to work around this. Clojure has STM (software
>> Transactional Memory) such that if an exception is through inside a doSync
>> all of the stems inside the critical block never happened. This assumes
>> your using all clojure structures which you are probably not.
>> A way co workers have done this is as follows. Move your entire
>> transnational s

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

2016-01-03 Thread Alexander Shraer
for 1, see the chubby paper
,
section 2.4.
for 2, I'm not sure I fully understand the question, but essentially, ZK
guarantees that even during failures
consistency of updates is preserved. The user doesn't need to do anything
in particular to guarantee this, even
during leader failures. In such case, some suffix of operations executed by
the leader may be lost if they weren't
previously acked by a majority.However, none of these operations could have
been visible
to reads.

On Fri, Jan 1, 2016 at 12:29 AM, powell molleti  wrote:

> Hi Janmejay,
> Regarding question 1, if a node takes a lock and the lock has timed-out
> from system perspective then it can mean that other nodes are free to take
> the lock and work on the resource. Hence the history could be well into the
> future when the previous node discovers the time-out. The question of
> rollback in the specific context depends on the implementation details, is
> the lock holder updating some common area?, then there could be corruption
> since other nodes are free to write in parallel to the first one?. In the
> usual sense a time-out of lock held means the node which held the lock is
> dead. It is upto the implementation to ensure this case and, using this
> primitive, if there is a timeout which means other nodes are sure that no
> one else is working on the resource and hence can move forward.
> Question 2 seems to imply the assumption that leader has significant work
> todo and leader change is quite common, which seems contrary to common
> implementation pattern. If the work can be broken down into smaller chunks
> which need serialization separately then each chunk/work type can have a
> different leader.
> For question 3, ZK does support auth and encryption for client connections
> but not for inter ZK node channels. Do you have requirement to secure inter
> ZK nodes, can you let us know what your requirements are so we can
> implement a solution to fit all needs?.
> For question 4 the official implementation is C, people tend to wrap that
> with C++ and there should projects that use ZK doing that you can look them
> up and see if you can separate it out and use them.
> Hope this helps.Powell.
>
>
>
> On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> edward.capri...@huffingtonpost.com> wrote:
>
>
>  Q:What is the best way of handling distributed-lock expiry? The owner
> of the lock managed to acquire it and may be in middle of some
> computation when the session expires or lock expire
>
> If you are using Java a way I can see doing this is by using the
> ExecutorCompletionService
>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> .
> It allows you to keep your workers in a group, You can poll the group and
> provide cancel semantics as needed.
> An example of that service is here:
>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> where I am issuing multiple reads and I want to abandon the process if they
> do not timeout in a while. Many async/promices frameworks do this by
> launching two task ComputationTask and a TimeoutTask that returns in 10
> seconds. Then they ask the completions service to poll. If the service is
> given the TimoutTask after the timeout that means the Computation did not
> finish in time.
>
> Do people generally take action in middle of the computation (abort it and
> do itin a clever way such that effect appears atomic, so abort is notreally
> visible, if so what are some of those clever ways)?
>
> The base issue is java's synchronized/ AtomicReference do not have a
> rollback.
>
> There are a few ways I know to work around this. Clojure has STM (software
> Transactional Memory) such that if an exception is through inside a doSync
> all of the stems inside the critical block never happened. This assumes
> your using all clojure structures which you are probably not.
> A way co workers have done this is as follows. Move your entire
> transnational state into a SINGLE big object that you can
> copy/mutate/compare and swap. You never need to rollback each piece because
> your changing the clone up until the point you commit it.
> Writing reversal code is possible depending on the problem. There are
> questions to ask like "what if the reversal somehow fails?"
>
>
>
>
> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay 
> wrote:
>
> > Hi,
> >
> > Was wondering if there are any reference designs, patterns on handling
> > common operations involving distributed coordination.
> >
> > I have a few questions and I guess they must have been asked before, I
> > am unsure what to search for to surface the right answers. It'll be
> > really valuable if someone can provide links to relevant
> > "best-practices guide" or "suggestions" per question 

Re: ZK + dynamic config + EC2

2015-11-22 Thread Alexander Shraer
Yes, unique ids require consensus (or admin), but ZK gives you this. you
could use for example the sequential flag or alternatively the set
conditional on a version.
On Nov 22, 2015 3:34 PM, "Oboturov, Artem" <artem.obotu...@zalando.de>
wrote:

> There is one more thing to it: ZK has a byte range for values server IDs -
> 255 in total, how could you allocate those to instances without having a
> central registry for IDs? It could be based on IPs, but it seems there are
> no other idempotent/reliable ways to get them assigned?
>
> On 21 November 2015 at 20:42, Alexander Shraer <shra...@gmail.com> wrote:
>
>> The only issue I see is that if the new server has the same id as the old
>> one you're replacing, I think you should first remove the old one and then
>> in a separate command add the new one. Intuitively this way you avoid
>> having the newly joining server act as someone who knows the current state
>> of the system (which it doesn't), a situation that may cause you to loose
>> transactions.
>>
>> Notice that there are two reconfig interfaces  -- incremental and bulk
>> (see manual
>> <https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html>).
>>
>> On Sat, Nov 21, 2015 at 1:16 PM, Oboturov, Artem <
>> artem.obotu...@zalando.de> wrote:
>>
>>> Hi
>>>
>>> I was looking at the ZK 3.5.x series new feature - dynamic configuration.
>>> As an example, we could have an EC2 auto scaling group for 3 ZK nodes.
>>> When
>>> one of them goes down, a new one would be spawned, but its IP could be
>>> different. We could query EC2 to get all instances from group and
>>> generate
>>> a config for ZK to take all currently running servers as part of ZK
>>> cluster, and then run an update of cluster configs for all existing ones
>>> using dynamic config feature. Would this strategy work? Are there any
>>> alternatives?
>>>
>>> --
>>> Regards
>>> Artem Oboturov
>>>
>>
>>
>
>
> --
> Regards
> Artem Oboturov
>


Re: ZK + dynamic config + EC2

2015-11-21 Thread Alexander Shraer
The only issue I see is that if the new server has the same id as the old
one you're replacing, I think you should first remove the old one and then
in a separate command add the new one. Intuitively this way you avoid
having the newly joining server act as someone who knows the current state
of the system (which it doesn't), a situation that may cause you to loose
transactions.

Notice that there are two reconfig interfaces  -- incremental and bulk (see
manual ).

On Sat, Nov 21, 2015 at 1:16 PM, Oboturov, Artem 
wrote:

> Hi
>
> I was looking at the ZK 3.5.x series new feature - dynamic configuration.
> As an example, we could have an EC2 auto scaling group for 3 ZK nodes. When
> one of them goes down, a new one would be spawned, but its IP could be
> different. We could query EC2 to get all instances from group and generate
> a config for ZK to take all currently running servers as part of ZK
> cluster, and then run an update of cluster configs for all existing ones
> using dynamic config feature. Would this strategy work? Are there any
> alternatives?
>
> --
> Regards
> Artem Oboturov
>


Re: Migrate Cluster

2015-10-01 Thread Alexander Shraer
assuming you're using 3.4 release and doing reboots to add/remove servers,
option 2 doesn't seem safe. For example,
if you have servers A, B, C and you're adding D and E, note that its
possible that C isn't fully up to date since A and B can make progress
without C's acks (2 out of 3). When you reboot all servers its possible
that C forms a quorum with D and E (3 out of 5) and is elected leader.
Then, anything more recent A and B had will be lost.

If I'm not mistaken option 1 shouldn't have this problem since any quorum
after each step always intersects any quorum before the step.

If you're using 3.5 release you should be able to do the whole migration in
a single reconfiguration step.

On Tue, Sep 29, 2015 at 11:31 AM, snair 123  wrote:

> Hello Experts
> I need to migrate an existing 3 node Zk ensemble to a different set of
> servers in a different networkSo servers A,B,C move to D,E,FWhat is the
> best way to do that without causing affecting the service
> This is is what we have been planning1. Add one node from the new network
> 2. Join and sync up 3. Remove one from the old network4. Do this for all
> the 3 nodes
> OR
> 1. Add 2 nodes in the new network2. Join and syncup3. Remove 2 nodes from
> the old network4. Add the 3rd node in the new network5. Make sure it is the
> leader ( a restart maybe ?)6. Remove the last remaining node from the old
> cluster
> The second option i would hope would ensure the leader stays in the
> network with most nodes and avoids problems due to network partitions ?
>
>


Re: Zab Failure scenario

2015-09-28 Thread Alexander Shraer
A reconfiguration is treated similarly to other proposals for recovery
purposes (of course commit is different
in that it changes the configuration). You can see the paper

for details on how recovery works in principle, and if you have a specific
question please feel free to ask.

On Mon, Sep 28, 2015 at 10:54 AM, Ibrahim El-sanosi (PGR) <
i.s.el-san...@newcastle.ac.uk> wrote:

> Yes, I am thinking of mixing an in-flight reconfiguration request with the
> crashing servers example that you gave Not about how proposals, acks,
> commits (i.e.: ZAB proper) work.
>
> Thank you
>
> -Original Message-
> From: Raúl Gutiérrez Segalés [mailto:r...@itevenworks.net]
> Sent: Monday, September 28, 2015 02:56 ص
> To: user@zookeeper.apache.org
> Subject: Re: Zab Failure scenario
>
> On 27 September 2015 at 10:12, Ibrahim El-sanosi (PGR) <
> i.s.el-san...@newcastle.ac.uk> wrote:
>
> > Thank you Flavio for explanation. It really makes sense for me.
> >
> > > I'm not sure why you are assuming 3.4.6,  though. Why is it relevant
> > > for
> > this question?
> >
> > I am assuming 3.4.6 because first I use this version, second I do not
> > know about dynamic configuration 3.5.0 as it may have different
> > solution for mentioned scenario.
> >
>
> I don't think dynamic reconfiguration changes anything about how
> proposals, acks, commits (i.e.: ZAB proper) work. Unless you are thinking
> of mixing an in-flight reconfiguration request with the crashing servers
> example that you gave
>
>
> -rgs
>


Re: 3-server Zab cluster

2015-09-28 Thread Alexander Shraer
Committing locally when sending an ACK at a server would lead to loss of
consistency - it is possible that this is the only
server that acks, e.g., this server is temporarily disconnected from the
leader, the leader gets re-elected and the operation is truncated from logs
at other servers. Its ok to ACK it but its not ok to commit since this
exposes this to users as a committed operation that they can see.

On Mon, Sep 28, 2015 at 4:19 AM, Ibrahim El-sanosi (PGR) <
i.s.el-san...@newcastle.ac.uk> wrote:

> In Zab, assume we have a cluster consists of 3-servers. To deliver a write
> request, it must run 3 communication steps proposal, acknowledgement and
> commit.
> As Zab uses reliable FIFO, it is possible to remove commit round. As soon
> as a follower receives a proposal, it logs, sends an ACK and commits
> locally. Upon receiving ACK from any follower, leader commits a proposal
> locally, no COMMIT message need to be sent to followers. In this case, all
> servers commit a proposal in two round-trips, resulting in reducing latency
> particularly in followers.
>
> Note that this optimization can only work in 3-servers cluster (follower
> reaches a majority as soon as it acks).
> Does anyone see any problems with such (small) optimization?
> Ibrahim
>


Re: [ANNOUNCE] New committer: Chris Nauroth

2015-09-28 Thread Alexander Shraer
Congrats Chris, and welcome!

On Mon, Sep 28, 2015 at 9:52 AM, Rakesh Radhakrishnan <
rakeshr.apa...@gmail.com> wrote:

> Welcome Chris, thanks for all your great work and congrats!
>
> -Rakesh
>
> On Mon, Sep 28, 2015 at 8:11 PM, Flavio Junqueira  wrote:
>
> > The Apache ZooKeeper PMC is pleased to announce that Chris Nauroth has
> > accepted to become a committer. Chris has been a great contributor and
> very
> > active in the community.
> >
> > Congrats, Chris!
> >
> > -Flavio
>


Re: 3-server Zab cluster

2015-09-28 Thread Alexander Shraer
I'm not 100% sure whether operations that were pending on the leader are
sent out during sync when this leader looses quorum and re-elected. If so,
then maybe you're right. But in any case, this would not work for 5 or more
servers...

On Mon, Sep 28, 2015 at 3:51 PM, Ibrahim El-sanosi (PGR) <
i.s.el-san...@newcastle.ac.uk> wrote:

> Thank you Alex for replaying.
>
> When you said " the leader gets re-elected and the operation is truncated
> from logs at other servers". I though the new leader will sync the its logs
> with other followers (synchronization phase), resulting in the operation
> will commit by new quorum.  Let me make the scenarios as steps:
>
> 1. leader  (L)  sends a proposal p with zxid =10 to F1 and F2.
> 2. F1 logs, sends an ACK, commits, replays to clients and crashes. F2
> crashes before receiving P10. L has not received any ACKs
>
> Possible solution  (1)
> The leader will move to LOOKING phase as there is no quorum supporting its
> leadership. Now Assume F2 wakes up. F2 forms a quorum with the L (pervious
> leader), L becomes new leader again as it has latest zxid (10) in its log.
> L syncs its state with F2, as a result L, F1 (before crashing) and F2
> commit P10.  Is that correct?
>
> Possible solution  (2)
> The leader will move to LOOKING phase as there is no quorum supporting its
> leadership. Now Assume F1 (with Zxid =10  committed) wakes up. I am not
> sure who should be a leader (F1 with Zxid =10 committed or L (pervious
> leader) with Zxid = 10 logged), I think F1 become a new leader as it has
> Zxid = 10 committed. F1 forms a quorum with the L (pervious leader), F1
> becomes new leader as it has latest zxid (10) . L (new leader) syncs its
> state with L (pervious leader now become a follower), as a result Zxid10
> commits by new quorum.  Is that correct?
>
> What do you think?
>
> Ibrahim
>
>
>
>
>
> -Original Message-
> From: Alexander Shraer [mailto:shra...@gmail.com]
> Sent: Monday, September 28, 2015 07:27 م
> To: user@zookeeper.apache.org
> Cc: d...@zookeeper.apache.org
> Subject: Re: 3-server Zab cluster
>
> Committing locally when sending an ACK at a server would lead to loss of
> consistency - it is possible that this is the only server that acks, e.g.,
> this server is temporarily disconnected from the leader, the leader gets
> re-elected and the operation is truncated from logs at other servers. Its
> ok to ACK it but its not ok to commit since this exposes this to users as a
> committed operation that they can see.
>
> On Mon, Sep 28, 2015 at 4:19 AM, Ibrahim El-sanosi (PGR) <
> i.s.el-san...@newcastle.ac.uk> wrote:
>
> > In Zab, assume we have a cluster consists of 3-servers. To deliver a
> > write request, it must run 3 communication steps proposal,
> > acknowledgement and commit.
> > As Zab uses reliable FIFO, it is possible to remove commit round. As
> > soon as a follower receives a proposal, it logs, sends an ACK and
> > commits locally. Upon receiving ACK from any follower, leader commits
> > a proposal locally, no COMMIT message need to be sent to followers. In
> > this case, all servers commit a proposal in two round-trips, resulting
> > in reducing latency particularly in followers.
> >
> > Note that this optimization can only work in 3-servers cluster
> > (follower reaches a majority as soon as it acks).
> > Does anyone see any problems with such (small) optimization?
> > Ibrahim
> >
>


Re: Uninvited ZK joins the cluster

2015-09-09 Thread Alexander Shraer
Hi,

There were some thoughts to send and check the database id (if I'm not
mistaken its called dbid) when servers connect to each other, which should
be different for different zookeepers. It shouldn't be difficult to add, if
you'd like to work on it.

Alex

On Wed, Sep 9, 2015 at 11:04 AM, Benjamin Jaton 
wrote:

> Hi,
>
> First I build a 3 nodes cluster (A,B,C) with zk.reconfig commands.
>
> Now we stop the all the ZK, and we make a new cluster (A,B,D) configured
> with the same ports.
>
> In that scenario, if you start C, it will join the ensemble, becoming
> (A,B,C,D).
> The problem is that it's not the same ensemble, C shouldn't have been
> allowed to join the new ensemble.
>
> Is there a way to prevent this from happening? (even if it's hacky)
>
> Thanks,
> Ben
>


Re: Doubts about libzookeeper

2015-08-04 Thread Alexander Shraer
maybe 1 or 2 synctime, is enough given what you said about syncs - after 1
synctime
we know that either server1 disconnected (and will have to bootstrap its
state from the leader
if it ever reconnects) or the request got to the leader. But since synctime
may not be measured
exactly from our request submission it maybe that 2 synctime are needed.
Would need to look
deeper into pings and synctime to tell for sure.

On Tue, Aug 4, 2015 at 2:05 PM, Camille Fournier cami...@apache.org wrote:

 That's true. I spent some time trying to think about when and how that
 would be possible, and didn't get very far. We have guarantees about how
 far out of sync a quorum member can be before it's booted, so I would think
 that there's some way to timebound this potentially to prevent it, a la
 your suggestion about 3X synctime.

 C


 On Tue, Aug 4, 2015 at 4:58 PM, Alexander Shraer shra...@gmail.com
 wrote:

  Yes, I checked and you're right. It gets queued at the leader until all
  previously proposed requests at the leader
  are committed. But still if the request is only on its way between
 server 1
  and the leader sync won't immediately help, right ?
 
 
  On Tue, Aug 4, 2015 at 11:39 AM, Camille Fournier cami...@apache.org
  wrote:
 
   I thought that sync forced a flush of the queued events on a quorum
  member
   before completing/got it in the path of events from the leader, so that
  it
   won't return until all of the pending leader events before it have been
   seen by this quorum member. Is that not correct?
  
   On Tue, Aug 4, 2015 at 2:20 PM, Alexander Shraer shra...@gmail.com
   wrote:
  
It seems that since the delete may be in-flight (between server 1 and
leader, or still being proposed by the leader)
when the client connects to server 2, doing a sync right a way may
 not
   help
since the operation hasn't been committed yet. Perhaps the client
  should
wait some multiple of synclimit time (3x ?) before invoking the sync
 to
allow the delete to commit or disappear for sure. This is all related
  to
https://issues.apache.org/jira/browse/ZOOKEEPER-22, which is still
  open
unfortunately...
   
On Tue, Aug 4, 2015 at 10:15 AM, Camille Fournier 
 cami...@apache.org
wrote:
   
 True, I'm not sure when the xid increments. If that is the case,
 you
   can
 force a sync before the read of the path, to prevent reading stale
   data.
So
 that would be the solve for that edge case although it's an
 expensive
 solve.

 C

 On Tue, Aug 4, 2015 at 12:52 PM, Alexander Shraer 
 shra...@gmail.com
  
 wrote:

  Hi Camille,
 
  if the client received a response for the delete then sure it
   shouldn't
 be
  able to connect
  to servers that didn't see it. But if it disconnected before
 seeing
   the
  response the example seems possible to me.
  I haven't checked the code to see when exactly the transaction
  number
is
  incremented at
  the client, so I may be wrong, but suppose for example that
   zkserver-1
  crashes before
  sending the delete request to the leader. Then, the request is
 gone
  forever. If you don't let the client
  connect to another server that hasn't seen the delete, the client
   will
  never be able to connect.
  So it seems quite possible that it connects, then the request is
executed
  (if zkserver-1 hasn't crashed
  after all) and the znode disappears.
 
  Alex
 
 
  On Tue, Aug 4, 2015 at 8:33 AM, Camille Fournier 
  cami...@apache.org
   
  wrote:
 
   ZooKeeper provides a session-coherent single system image
   guarantee.
 Any
   request from the same session will see the results of all of
 its
 writes,
   regardless of which server it connects to. See:
  
  
 

   
  
 
 http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.html#ch_zkGuarantees
  
   So, if your session deletes, and the delete is successfully
   processed
 by
   the quorum, you will not see the path that you have deleted no
   matter
  what
   server your session connects to. I believe in practice that
 this
means
  that
   the ZK servers that might be behind your session (say server 2
 is
 lagging
   behind a few commits) will refuse to allow your session to
  connect
   to
 it,
   so that you will not see stale data.
  
   This means that the example Lokesh gave:
  
   1. Quorum leader has forwarded request to zkserver-2 for
 delete
 /path.
   2. If your client connects to zkserver-2 after step 1 is
  executed
 (get
   /path). Then your /path will not be available.
   3. If your client connects to zkserver-2 before step1 is
  executed
 (get
   /path) then your /path would be available and after some time
   your
 path
   would not be available (after zkserver-2 is synched

Re: Doubts about libzookeeper

2015-08-04 Thread Alexander Shraer
Hi Camille,

if the client received a response for the delete then sure it shouldn't be
able to connect
to servers that didn't see it. But if it disconnected before seeing the
response the example seems possible to me.
I haven't checked the code to see when exactly the transaction number is
incremented at
the client, so I may be wrong, but suppose for example that zkserver-1
crashes before
sending the delete request to the leader. Then, the request is gone
forever. If you don't let the client
connect to another server that hasn't seen the delete, the client will
never be able to connect.
So it seems quite possible that it connects, then the request is executed
(if zkserver-1 hasn't crashed
after all) and the znode disappears.

Alex


On Tue, Aug 4, 2015 at 8:33 AM, Camille Fournier cami...@apache.org wrote:

 ZooKeeper provides a session-coherent single system image guarantee. Any
 request from the same session will see the results of all of its writes,
 regardless of which server it connects to. See:

 http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.html#ch_zkGuarantees

 So, if your session deletes, and the delete is successfully processed by
 the quorum, you will not see the path that you have deleted no matter what
 server your session connects to. I believe in practice that this means that
 the ZK servers that might be behind your session (say server 2 is lagging
 behind a few commits) will refuse to allow your session to connect to it,
 so that you will not see stale data.

 This means that the example Lokesh gave:

 1. Quorum leader has forwarded request to zkserver-2 for delete /path.
 2. If your client connects to zkserver-2 after step 1 is executed (get
 /path). Then your /path will not be available.
 3. If your client connects to zkserver-2 before step1 is executed (get
 /path) then your /path would be available and after some time your path
 would not be available (after zkserver-2 is synched with the leader)

 Cannot happen, so long as you are in the same session.

 C

 On Tue, Aug 4, 2015 at 6:49 AM, Lokesh Shrivastava 
 lokesh.shrivast...@gmail.com wrote:

  I think it depends on whether your request reaches zkserver-1 and whether
  it is able to send the request to quorum leader. Considering that delete
  /path request has reached the quorum leader then following may happen
 
  1. Quorum leader has forwarded request to zkserver-2 for delete /path.
  2. If your client connects to zkserver-2 after step 1 is executed (get
  /path). Then your /path will not be available.
  3. If your client connects to zkserver-2 before step1 is executed (get
  /path) then your /path would be available and after some time your path
  would not be available (after zkserver-2 is synched with the leader)
 
  Others can correct me if this is not how it works.
 
  Thanks.
  Lokesh
 
  On 4 August 2015 at 12:09, liangdon...@baidu.com liangdon...@baidu.com
  wrote:
 
   Hi,
I'm thinking about a program desgin with libzookeeper, here is my
   doubts:
  
   1) first, I connnect to zkserver-1, and there exists the path
  /path.
   2) I sends delete /path, the request reaches(may not, i don't
 know
   about that) zkserver-1 and dont't know whether this effected, and then
  lost
   connection before response returns.
   3) reconnect the same session to zkserver-2,  and I sends get
  /path.
  
   which one will the get /path return possibly :
   1, not exists
   2, exists and always exists
   3, exists and not exists afterwards
  
   my biggist problem is wether the 3) will occur or not, thanks!
  
  
  
  
   liangdon...@baidu.com
  
 



Re: Doubts about libzookeeper

2015-08-04 Thread Alexander Shraer
It seems that since the delete may be in-flight (between server 1 and
leader, or still being proposed by the leader)
when the client connects to server 2, doing a sync right a way may not help
since the operation hasn't been committed yet. Perhaps the client should
wait some multiple of synclimit time (3x ?) before invoking the sync to
allow the delete to commit or disappear for sure. This is all related to
https://issues.apache.org/jira/browse/ZOOKEEPER-22, which is still open
unfortunately...

On Tue, Aug 4, 2015 at 10:15 AM, Camille Fournier cami...@apache.org
wrote:

 True, I'm not sure when the xid increments. If that is the case, you can
 force a sync before the read of the path, to prevent reading stale data. So
 that would be the solve for that edge case although it's an expensive
 solve.

 C

 On Tue, Aug 4, 2015 at 12:52 PM, Alexander Shraer shra...@gmail.com
 wrote:

  Hi Camille,
 
  if the client received a response for the delete then sure it shouldn't
 be
  able to connect
  to servers that didn't see it. But if it disconnected before seeing the
  response the example seems possible to me.
  I haven't checked the code to see when exactly the transaction number is
  incremented at
  the client, so I may be wrong, but suppose for example that zkserver-1
  crashes before
  sending the delete request to the leader. Then, the request is gone
  forever. If you don't let the client
  connect to another server that hasn't seen the delete, the client will
  never be able to connect.
  So it seems quite possible that it connects, then the request is executed
  (if zkserver-1 hasn't crashed
  after all) and the znode disappears.
 
  Alex
 
 
  On Tue, Aug 4, 2015 at 8:33 AM, Camille Fournier cami...@apache.org
  wrote:
 
   ZooKeeper provides a session-coherent single system image guarantee.
 Any
   request from the same session will see the results of all of its
 writes,
   regardless of which server it connects to. See:
  
  
 
 http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.html#ch_zkGuarantees
  
   So, if your session deletes, and the delete is successfully processed
 by
   the quorum, you will not see the path that you have deleted no matter
  what
   server your session connects to. I believe in practice that this means
  that
   the ZK servers that might be behind your session (say server 2 is
 lagging
   behind a few commits) will refuse to allow your session to connect to
 it,
   so that you will not see stale data.
  
   This means that the example Lokesh gave:
  
   1. Quorum leader has forwarded request to zkserver-2 for delete
 /path.
   2. If your client connects to zkserver-2 after step 1 is executed
 (get
   /path). Then your /path will not be available.
   3. If your client connects to zkserver-2 before step1 is executed
 (get
   /path) then your /path would be available and after some time your
 path
   would not be available (after zkserver-2 is synched with the leader)
  
   Cannot happen, so long as you are in the same session.
  
   C
  
   On Tue, Aug 4, 2015 at 6:49 AM, Lokesh Shrivastava 
   lokesh.shrivast...@gmail.com wrote:
  
I think it depends on whether your request reaches zkserver-1 and
  whether
it is able to send the request to quorum leader. Considering that
  delete
/path request has reached the quorum leader then following may
 happen
   
1. Quorum leader has forwarded request to zkserver-2 for delete
  /path.
2. If your client connects to zkserver-2 after step 1 is executed
  (get
/path). Then your /path will not be available.
3. If your client connects to zkserver-2 before step1 is executed
  (get
/path) then your /path would be available and after some time your
  path
would not be available (after zkserver-2 is synched with the leader)
   
Others can correct me if this is not how it works.
   
Thanks.
Lokesh
   
On 4 August 2015 at 12:09, liangdon...@baidu.com 
  liangdon...@baidu.com
wrote:
   
 Hi,
  I'm thinking about a program desgin with libzookeeper, here is
  my
 doubts:

 1) first, I connnect to zkserver-1, and there exists the path
/path.
 2) I sends delete /path, the request reaches(may not, i don't
   know
 about that) zkserver-1 and dont't know whether this effected, and
  then
lost
 connection before response returns.
 3) reconnect the same session to zkserver-2,  and I sends get
/path.

 which one will the get /path return possibly :
 1, not exists
 2, exists and always exists
 3, exists and not exists afterwards

 my biggist problem is wether the 3) will occur or not, thanks!




 liangdon...@baidu.com

   
  
 



Re: Doubts about libzookeeper

2015-08-04 Thread Alexander Shraer
Yes, I checked and you're right. It gets queued at the leader until all
previously proposed requests at the leader
are committed. But still if the request is only on its way between server 1
and the leader sync won't immediately help, right ?


On Tue, Aug 4, 2015 at 11:39 AM, Camille Fournier cami...@apache.org
wrote:

 I thought that sync forced a flush of the queued events on a quorum member
 before completing/got it in the path of events from the leader, so that it
 won't return until all of the pending leader events before it have been
 seen by this quorum member. Is that not correct?

 On Tue, Aug 4, 2015 at 2:20 PM, Alexander Shraer shra...@gmail.com
 wrote:

  It seems that since the delete may be in-flight (between server 1 and
  leader, or still being proposed by the leader)
  when the client connects to server 2, doing a sync right a way may not
 help
  since the operation hasn't been committed yet. Perhaps the client should
  wait some multiple of synclimit time (3x ?) before invoking the sync to
  allow the delete to commit or disappear for sure. This is all related to
  https://issues.apache.org/jira/browse/ZOOKEEPER-22, which is still open
  unfortunately...
 
  On Tue, Aug 4, 2015 at 10:15 AM, Camille Fournier cami...@apache.org
  wrote:
 
   True, I'm not sure when the xid increments. If that is the case, you
 can
   force a sync before the read of the path, to prevent reading stale
 data.
  So
   that would be the solve for that edge case although it's an expensive
   solve.
  
   C
  
   On Tue, Aug 4, 2015 at 12:52 PM, Alexander Shraer shra...@gmail.com
   wrote:
  
Hi Camille,
   
if the client received a response for the delete then sure it
 shouldn't
   be
able to connect
to servers that didn't see it. But if it disconnected before seeing
 the
response the example seems possible to me.
I haven't checked the code to see when exactly the transaction number
  is
incremented at
the client, so I may be wrong, but suppose for example that
 zkserver-1
crashes before
sending the delete request to the leader. Then, the request is gone
forever. If you don't let the client
connect to another server that hasn't seen the delete, the client
 will
never be able to connect.
So it seems quite possible that it connects, then the request is
  executed
(if zkserver-1 hasn't crashed
after all) and the znode disappears.
   
Alex
   
   
On Tue, Aug 4, 2015 at 8:33 AM, Camille Fournier cami...@apache.org
 
wrote:
   
 ZooKeeper provides a session-coherent single system image
 guarantee.
   Any
 request from the same session will see the results of all of its
   writes,
 regardless of which server it connects to. See:


   
  
 
 http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.html#ch_zkGuarantees

 So, if your session deletes, and the delete is successfully
 processed
   by
 the quorum, you will not see the path that you have deleted no
 matter
what
 server your session connects to. I believe in practice that this
  means
that
 the ZK servers that might be behind your session (say server 2 is
   lagging
 behind a few commits) will refuse to allow your session to connect
 to
   it,
 so that you will not see stale data.

 This means that the example Lokesh gave:

 1. Quorum leader has forwarded request to zkserver-2 for delete
   /path.
 2. If your client connects to zkserver-2 after step 1 is executed
   (get
 /path). Then your /path will not be available.
 3. If your client connects to zkserver-2 before step1 is executed
   (get
 /path) then your /path would be available and after some time
 your
   path
 would not be available (after zkserver-2 is synched with the
 leader)

 Cannot happen, so long as you are in the same session.

 C

 On Tue, Aug 4, 2015 at 6:49 AM, Lokesh Shrivastava 
 lokesh.shrivast...@gmail.com wrote:

  I think it depends on whether your request reaches zkserver-1 and
whether
  it is able to send the request to quorum leader. Considering that
delete
  /path request has reached the quorum leader then following may
   happen
 
  1. Quorum leader has forwarded request to zkserver-2 for delete
/path.
  2. If your client connects to zkserver-2 after step 1 is
 executed
(get
  /path). Then your /path will not be available.
  3. If your client connects to zkserver-2 before step1 is
 executed
(get
  /path) then your /path would be available and after some time
  your
path
  would not be available (after zkserver-2 is synched with the
  leader)
 
  Others can correct me if this is not how it works.
 
  Thanks.
  Lokesh
 
  On 4 August 2015 at 12:09, liangdon...@baidu.com 
liangdon...@baidu.com
  wrote:
 
   Hi,
I'm thinking about a program desgin with libzookeeper

Re: Doubts about libzookeeper

2015-08-04 Thread Alexander Shraer
 just do it again once reconnected

right, the whole discussion is unnecessarily complex for a delete op  :)

On Tue, Aug 4, 2015 at 2:29 PM, Flavio Junqueira f...@apache.org wrote:

 Touché!

 -Flavio

  On 04 Aug 2015, at 22:21, Jordan Zimmerman jor...@jordanzimmerman.com
 wrote:
 
  If the client isn't sure that the delete has gone through, just do it
 again once reconnected (to server 2 in the scenario described). Whatever
 response you get for the delete should determine what you need to do.
 
 
  FYI - this is why Curator has “guaranteed deletes. So many recipes
 depend on the delete succeeding.
 
  -JZ
 




Re: starting a ZK cluster one node at a time

2015-07-24 Thread Alexander Shraer
Hi,

When you're adding a node its config file should contain the current set of
servers + itself. This will allow it to boot and connect to the cluster
(once it does, its config file is overwritten automatically with the latest
config of the cluster, which doesn't include the new node). Then, you
should execute reconfig to logically add it to the cluster. This will add
it to the config files at all servers. If you don't do this, the new node
will not have voting rights. For example with 2 servers when you're adding
a 3rd, if you don't run a reconfig command then even though you have 3
servers any server failure of the original two servers will make your
ensemble unavailalble. You can read the reconfig manual for details.


Cheers,
Alex

On Fri, Jul 24, 2015 at 3:53 PM, Emmanuel ele...@msn.com wrote:

 Hello,
 I am setting up ZK in docker. One of the issue is I don't know on what
 host the node will be deployed, and what the IP will be, so I need to
 configure dynamically.
 with 3.5.0 it seems like i can update the quorum dynamically. Just wanted
 to confirm this flow would work:
 - start container and configure the ZK node with its IP/ports in the
 dynamic config file - start zk node- start a second container linked
 container, configure with the first node and the new node's info - start
 second zk node. - start 3rd container, configure with previous nodes info
 and start zk server.
 question: = will the first node update its config when the second
 node/third node join? or does it need to receive some kind of signal (i.e.
 run reconfig or some other command?)
 I haven't tried but before I spent the time writing this, i'd like to
 confirm it's possible.Will the config file alone get the ZK nodes to find
 each other, or do I need to run the  reconfig -file newconfig.cfg or 
 reconfig -members server.1=125.23.63.23:2780
 :2783:participant;2791,server.2=125.23.63.24:2781
 :2784:participant;2792,server.3=125.23.63.25:2782:2785:participant;2793
 type of commands each time i add a node?
 I hope this is a straight forward question to answer, or that there is a
 'recommended' way to proceed when the IP of the node is only known at
 launch.
 Thanks for help.



Re: starting a ZK cluster one node at a time

2015-07-24 Thread Alexander Shraer
yes, see examples here:
http://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html

 reconfig -add server.5=125.23.63.23:1234:1235;1236

On Fri, Jul 24, 2015 at 4:08 PM, Emmanuel ele...@msn.com wrote:

 Thanks Alex

 Reconfigure is something I can do through command line, right?


 Emmanuel



  Original message 
 From: Alexander Shraer shra...@gmail.com
 Date:07/24/2015  4:04 PM  (GMT-08:00)
 To: user@zookeeper.apache.org
 Subject: Re: starting a ZK cluster one node at a time

 Hi,

 When you're adding a node its config file should contain the current set of
 servers + itself. This will allow it to boot and connect to the cluster
 (once it does, its config file is overwritten automatically with the latest
 config of the cluster, which doesn't include the new node). Then, you
 should execute reconfig to logically add it to the cluster. This will add
 it to the config files at all servers. If you don't do this, the new node
 will not have voting rights. For example with 2 servers when you're adding
 a 3rd, if you don't run a reconfig command then even though you have 3
 servers any server failure of the original two servers will make your
 ensemble unavailalble. You can read the reconfig manual for details.


 Cheers,
 Alex

 On Fri, Jul 24, 2015 at 3:53 PM, Emmanuel ele...@msn.com wrote:

  Hello,
  I am setting up ZK in docker. One of the issue is I don't know on what
  host the node will be deployed, and what the IP will be, so I need to
  configure dynamically.
  with 3.5.0 it seems like i can update the quorum dynamically. Just wanted
  to confirm this flow would work:
  - start container and configure the ZK node with its IP/ports in the
  dynamic config file - start zk node- start a second container linked
  container, configure with the first node and the new node's info - start
  second zk node. - start 3rd container, configure with previous nodes info
  and start zk server.
  question: = will the first node update its config when the second
  node/third node join? or does it need to receive some kind of signal
 (i.e.
  run reconfig or some other command?)
  I haven't tried but before I spent the time writing this, i'd like to
  confirm it's possible.Will the config file alone get the ZK nodes to find
  each other, or do I need to run the  reconfig -file newconfig.cfg or 
  reconfig -members server.1=125.23.63.23:2780
  :2783:participant;2791,server.2=125.23.63.24:2781
  :2784:participant;2792,server.3=125.23.63.25:2782:2785:participant;2793
  type of commands each time i add a node?
  I hope this is a straight forward question to answer, or that there is a
  'recommended' way to proceed when the IP of the node is only known at
  launch.
  Thanks for help.
 



Re: new paper on optimizing replication config

2015-07-15 Thread Alexander Shraer
thanks Edward!!

On Wed, Jul 15, 2015 at 6:06 PM, Edward Ribeiro edward.ribe...@gmail.com
wrote:

 Congratulations, Alex! :)

 VLDB is a top notch conference, and the subject is very interesting, so
 your paper certainly deserves a closer look. Thanks for sharing!

 Edward
 Em 15/07/2015 21:06, Alexander Shraer shra...@gmail.com escreveu:

  Our paper http://www.cs.technion.ac.il/~shralex/p2309-shraer.pdf on
  optimizing the configuration of distributed storage was recently accepted
  to the International Conference on Very Large Databases (VLDB). It
  basically shows that reconfiguration can be used to significantly improve
  latency.
 
  It isn't directly related to ZooKeeper, but I thought that it may
 interest
  some people on this list.
 
  Best Regards,
  Alex
 



Re: locking/leader election and dealing with session loss

2015-07-15 Thread Alexander Shraer
+1 to what Camille is saying  suggestion to use generations

On Wed, Jul 15, 2015 at 12:04 PM, Camille Fournier skami...@gmail.com
wrote:

 If client a does a full gc immediately before sending a message that is
 long enough to lose the lock, it will send the message out of order. You
 cannot guarantee exclusive access without verification at the locked
 resource.

 C
 On Jul 15, 2015 3:02 PM, Jordan Zimmerman jor...@jordanzimmerman.com
 wrote:

  I don’t see how there’s a chance of multiple writers. Assuming a
  reasonable session timeout:
 
  * Client A gets the lock
  * Client B watches Client A’s lock node
  * Client A gets a network partition
  * Client A will get a SysDisconnected before the session times out
  * Client A must immediately assume it no longer has the lock
  * Client A’s session times out
  * Client A’s ephemeral node is deleted
  * Client B’s watch fires
  * Client B takes the lock
  * Client A reconnects and gets SESSION_EXPIRED
 
  Where’s the problem? This is how everyone uses ZooKeeper. There is 0
  chance of multiple writers in this scenario.
 
 
 
  On July 15, 2015 at 1:56:37 PM, Vikas Mehta (vikasme...@gmail.com)
 wrote:
 
  Camille, I don't have a central message store/processor that can
 guarantee
  single writer (if I had one, it would reduce (still useful in reducing
 lock
  contention, etc) the need/value of using zookeeper) and hence I am trying
  to
  minimize the chances of multiple writers (more or less trying to
 guarantee
  this) while maximizing availability (not trying to solve CAP theorem), by
  solving some specific issues that affect availability.
 
 
 
  --
  View this message in context:
 
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581284.html
  Sent from the zookeeper-user mailing list archive at Nabble.com.
 



Re: locking/leader election and dealing with session loss

2015-07-15 Thread Alexander Shraer
This property may hold if you make a lot of timing/synchrony assumptions --
agreeing on who holds the lock in an asynchronous distributed system with
failures is impossible, this is the FLP impossibility.

But even if it holds, this property is not very useful if the ZK client
itself doesn't have the application data. So one has to consider whether it
is possible that the application sees a messages from two clients that both
think are the leader in an order which contradicts the lock acquisition
order.

On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman 
jor...@jordanzimmerman.com wrote:

 I think we may be talking past each other here. My contention (and the ZK
 docs agree BTW) is that, properly written and configured, at any
 snapshot in time no two clients think they hold the same lock”. How your
 application acts on that fact is another thing. You might need sequence
 numbers, you might not.

 -Jordan


 On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
 wrote:

 Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
 link
 
 http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf


 it suggests 2 ways in which the storage can support lock generations and
 proposes an alternative for the case where the storage can't be made aware
 of lock generations.

 On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman 
 jor...@jordanzimmerman.com wrote:

  Ivan, I just read the blog and I still don’t see how this can happen.
  Sorry if I’m being dense. I’d appreciate a discussion on this. In your
 blog
  you state: when ZooKeeper tells you that you are leader, there’s no
  guarantee that there isn’t another node that 'thinks' its the leader.”
  However, given a long enough session time — I usually recommend 30–60
  seconds, I don’t see how this can happen. The client itself determines
 that
  there is a network partition when there is no heartbeat success. The
  heartbeat is a fraction of the session timeout. Once the heartbeat
 fails,
  the client must assume it no longer has the lock. Another client cannot
  take over the lock until, at minimum, session timeout. So, how then can
  there be two leaders?
 
  -Jordan
 
  On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:
 
  I blogged about this exact problem a couple of weeks ago [1]. I give an
  example of how split brain can happen in a resource under a zk lock
 (Hbase
  in this case). As Camille says, sequence numbers ftw. I'll add that the
  data store has to support them though, which not all do (in fact I've
 yet
  to see one in the wild that does). I've implemented a prototype that
 works
  with hbase[2] if you want to see what it looks like.
 
  -Ivan
 
  [1]
 
 
 https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
  [2] https://github.com/ivankelly/hbase-exclusive-writer
 
  On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
 wrote:
 
   Jordan, I mean the client gives up the lock and stops working on the
  shared
   resource. So when zookeeper is unavailable, no one is working on any
  shared
   resource (because they cannot distinguish network partition from
  zookeeper
   DEAD scenario).
  
  
  
   --
   View this message in context:
  
 
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
   Sent from the zookeeper-user mailing list archive at Nabble.com.
  
 




Re: locking/leader election and dealing with session loss

2015-07-15 Thread Alexander Shraer
I disagree, ZooKeeper itself actually doesn't rely on timing for safety -
it won't get into an inconsistent state even if all timing assumptions fail
(except for the sync operation, which is then not guaranteed to return the
latest value, but that's a known issue that needs to be fixed).




On Wed, Jul 15, 2015 at 2:13 PM, Jordan Zimmerman 
jor...@jordanzimmerman.com wrote:

 This property may hold if you make a lot of timing/synchrony assumptions

 These assumptions and timing are intrinsic to using ZooKeeper. So, of
 course I’m making these assumptions.

 -Jordan



 On July 15, 2015 at 3:57:12 PM, Alexander Shraer (shra...@gmail.com)
 wrote:

 This property may hold if you make a lot of timing/synchrony assumptions
 -- agreeing on who holds the lock in an asynchronous distributed system
 with failures is impossible, this is the FLP impossibility.

 But even if it holds, this property is not very useful if the ZK client
 itself doesn't have the application data. So one has to consider whether it
 is possible that the application sees a messages from two clients that both
 think are the leader in an order which contradicts the lock acquisition
 order.

 On Wed, Jul 15, 2015 at 1:26 PM, Jordan Zimmerman 
 jor...@jordanzimmerman.com wrote:

  I think we may be talking past each other here. My contention (and the
 ZK docs agree BTW) is that, properly written and configured, at any
 snapshot in time no two clients think they hold the same lock”. How your
 application acts on that fact is another thing. You might need sequence
 numbers, you might not.

 -Jordan


 On July 15, 2015 at 3:15:16 PM, Alexander Shraer (shra...@gmail.com)
 wrote:

  Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
 link
 
 http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
 

 it suggests 2 ways in which the storage can support lock generations and
 proposes an alternative for the case where the storage can't be made aware
 of lock generations.

 On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman 
 jor...@jordanzimmerman.com wrote:

  Ivan, I just read the blog and I still don’t see how this can happen.
  Sorry if I’m being dense. I’d appreciate a discussion on this. In your
 blog
  you state: when ZooKeeper tells you that you are leader, there’s no
  guarantee that there isn’t another node that 'thinks' its the leader.”
  However, given a long enough session time — I usually recommend 30–60
  seconds, I don’t see how this can happen. The client itself determines
 that
  there is a network partition when there is no heartbeat success. The
  heartbeat is a fraction of the session timeout. Once the heartbeat
 fails,
  the client must assume it no longer has the lock. Another client cannot
  take over the lock until, at minimum, session timeout. So, how then can
  there be two leaders?
 
  -Jordan
 
  On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:
 
  I blogged about this exact problem a couple of weeks ago [1]. I give an
  example of how split brain can happen in a resource under a zk lock
 (Hbase
  in this case). As Camille says, sequence numbers ftw. I'll add that the
  data store has to support them though, which not all do (in fact I've
 yet
  to see one in the wild that does). I've implemented a prototype that
 works
  with hbase[2] if you want to see what it looks like.
 
  -Ivan
 
  [1]
 
 
 https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
  [2] https://github.com/ivankelly/hbase-exclusive-writer
 
  On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com
 wrote:
 
   Jordan, I mean the client gives up the lock and stops working on the
  shared
   resource. So when zookeeper is unavailable, no one is working on any
  shared
   resource (because they cannot distinguish network partition from
  zookeeper
   DEAD scenario).
  
  
  
   --
   View this message in context:
  
 
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
   Sent from the zookeeper-user mailing list archive at Nabble.com.
  
 





Re: locking/leader election and dealing with session loss

2015-07-15 Thread Alexander Shraer
Jordan, as Camille suggested, please read Sec 2.4 in the Chubby paper:
link
http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

it suggests 2 ways in which the storage can support lock generations and
proposes an alternative for the case where the storage can't be made aware
of lock generations.

On Wed, Jul 15, 2015 at 1:08 PM, Jordan Zimmerman 
jor...@jordanzimmerman.com wrote:

 Ivan, I just read the blog and I still don’t see how this can happen.
 Sorry if I’m being dense. I’d appreciate a discussion on this. In your blog
 you state: when ZooKeeper tells you that you are leader, there’s no
 guarantee that there isn’t another node that 'thinks' its the leader.”
 However, given a long enough session time — I usually recommend 30–60
 seconds, I don’t see how this can happen. The client itself determines that
 there is a network partition when there is no heartbeat success. The
 heartbeat is a fraction of the session timeout. Once the heartbeat fails,
 the client must assume it no longer has the lock. Another client cannot
 take over the lock until, at minimum, session timeout. So, how then can
 there be two leaders?

 -Jordan

 On July 15, 2015 at 2:23:12 PM, Ivan Kelly (iv...@apache.org) wrote:

 I blogged about this exact problem a couple of weeks ago [1]. I give an
 example of how split brain can happen in a resource under a zk lock (Hbase
 in this case). As Camille says, sequence numbers ftw. I'll add that the
 data store has to support them though, which not all do (in fact I've yet
 to see one in the wild that does). I've implemented a prototype that works
 with hbase[2] if you want to see what it looks like.

 -Ivan

 [1]

 https://medium.com/@ivankelly/reliable-table-writer-locks-for-hbase-731024295215
 [2] https://github.com/ivankelly/hbase-exclusive-writer

 On Wed, Jul 15, 2015 at 9:16 PM Vikas Mehta vikasme...@gmail.com wrote:

  Jordan, I mean the client gives up the lock and stops working on the
 shared
  resource. So when zookeeper is unavailable, no one is working on any
 shared
  resource (because they cannot distinguish network partition from
 zookeeper
  DEAD scenario).
 
 
 
  --
  View this message in context:
 
 http://zookeeper-user.578899.n2.nabble.com/locking-leader-election-and-dealing-with-session-loss-tp7581277p7581293.html
  Sent from the zookeeper-user mailing list archive at Nabble.com.
 



new paper on optimizing replication config

2015-07-15 Thread Alexander Shraer
Our paper http://www.cs.technion.ac.il/~shralex/p2309-shraer.pdf on
optimizing the configuration of distributed storage was recently accepted
to the International Conference on Very Large Databases (VLDB). It
basically shows that reconfiguration can be used to significantly improve
latency.

It isn't directly related to ZooKeeper, but I thought that it may interest
some people on this list.

Best Regards,
Alex


Re: Is myid actually limited to [1, 255]?

2015-07-13 Thread Alexander Shraer
negative ids could break stuff, such as here:

https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L321

On Mon, Jul 13, 2015 at 11:07 AM, Raúl Gutiérrez Segalés 
r...@itevenworks.net wrote:

 Hi,

 On 13 July 2015 at 10:43, Benjamin Anderson b...@banjiewen.net wrote:

  Hi there - I've observed that the documentation[1] suggests that each
  node's myid should be an integer in the range [1, 255]. Is that
  limitation codified anywhere? A quick perusal of the source suggests
  that myid is parsed in to a Long and passed around as such through the
  codebase.
 
  For context, I'm working on automating a ZK deployment and having a
  larger range for myid values would make my life easier.
 

 Yeah, definitely wrong (I've used 0, -1, ..). Mind opening a JIRA to get it
 fixed (a patch would be
 most welcomed too!). Otherwise I'll get to it a bit later. Thanks!


 -rgs



Re: ZooKeeper ensemble. Size and Impact ?

2015-07-13 Thread Alexander Shraer
In 3.4 releases you can't connect an observer to a standalone zookeeper
server, but in 3.5.0
if you set standaloneEnabled=false your server will run in a distributed
mode even if its the only one and
you'll be able to have observers or reconfigure adding more servers later
if needed.

On Mon, Jul 13, 2015 at 5:34 AM, Rakesh R rake...@huawei.com wrote:


  Is it so that only ensemble would be down but other functions would
 be up and running like data-sync ... ?
 Say, if a ZooKeeper server lost connection with the quorum. It will
 shutdown all the services and try to join the quorum by starting internal
 election algo. There is a special type of read-only server, on connection
 lost, it will automatically transition to r-o mode and serve only the
 requests from r-o client. Please visit
 http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html for more
 details about r-o feature.

  My need to run only 2 ZKS as I'm ok with have +1 copy of the data.
 Is there a way to run a dummy ZKS in any of the instance ?
 There is a typical 'Observer' server mode which will act as an observer
 and only syncup data with the Leader server, but I'm not really sure
 whether it will work along with Standalone server. I haven't tried yet,
 probably you can do a try
 http://zookeeper.apache.org/doc/r3.5.0-alpha/zookeeperObservers.html

 To begin with, you can run both as Participant and later if you want to
 change servers you can use reconfig feature,
 http://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html
 In 1+1 deployment, tolerated failure is 0 and you should ensure both
 servers are up  running for the availability of ZooKeeper service. I could
 see one advantage of this approach is, you have a backup 'dataDir'.
 Administrator can use this if one is lost.


 -Rakesh
 -Original Message-
 From: Srinivasan Veerapandian [mailto:srinivasan.veerapand...@ericsson.com
 ]
 Sent: 13 July 2015 15:01
 To: user@zookeeper.apache.org
 Subject: RE: ZooKeeper ensemble. Size and Impact ?

 Rakesh  Garry,



 Thanks for the information and details.  From both of your responses I can
 see that, more failures will cause drop of quorum automatically.

 Is it so that only ensemble would be down but other functions would be up
 and running like data-sync ... ? Sorry If this is very basic question.



 I see a note below note, does this means we can form ensemble with
 leaderServes turned ON.

 Turning on leader selection is highly recommended when you have more than
 three ZooKeeper servers in an ensemble.
 http://zookeeper.apache.org/doc/r3.3.2/zookeeperAdmin.html

 My need to run only 2 ZKS as I'm ok with have +1 copy of the data. Is
 there a way to run a dummy ZKS in any of the instance ?



 Thanks,

 Srini

 -Original Message-
 From: Rakesh R [mailto:rake...@huawei.com]
 Sent: Monday, July 13, 2015 1:43 PM
 To: user@zookeeper.apache.org; Srinivasan Veerapandian
 Subject: RE: ZooKeeper ensemble. Size and Impact ?



 Hi Srini,



 ZooKeeper service will be available if 'quorum' number of servers are
 running(simple majority voting factors).



 I could see, one of the reason to get a majority vote is to avoid
 split-brain problem. In a network failure we don't want the two parts of
 the system to continue as usual. We need only one part to continue and the
 other to understand that it is out of the cluster and keep quiet.



 The main reason for suggesting odd number is, with even there won't get
 much benefit to the tolerated failures in terms of majority. With 3 and 4
 servers, we could see the majority is 2 and 3. But in both the cases, the
 tolerated number of failure is 1.



 Quorum = Leader + Followers,

 (2n+1) nodes can tolerate failure of 'n' nodes.



 For example,

 n=0, (2*0+1) - 1 server = standalone. Here there is no quorum majority.

  - 2 servers = majority is 2. So it needs min 2 servers to
 form quorum. Tolerated failure is 0, if 0 failure will drop quorum
 automatically.



 n=1, (2*1+1) - 3 servers = majority is 2. So it needs min 2 servers to
 form quorum. Tolerated failure is 1, if 1 failures will drop quorum
 automatically.

  - 4 servers = majority is 3. So it needs min 3 servers to
 form quorum. Tolerated failure is 1, if 1 failures will drop quorum
 automatically.



 n=2, (2*2+1) - 5 servers = majority is 3. So it needs min 3 servers to
 form quorum. Tolerated failure is 2, if 2 failures will drop quorum
 automatically.

  - 6 servers = majority is 4. So it needs min 4 servers to
 form quorum. Tolerated failure is 2, if 2 failures will drop quorum
 automatically.



 n=3, (2*3+1) - 7 servers = majority is 4. So it needs min 4 servers to
 form quorum. Tolerated failure is 3, if 3 failures will drop quorum
 automatically.

  - 8 servers = majority is 5. So it needs min 5 servers to
 form quorum. Tolerated failure is 3, if 3 failures will drop quorum
 automatically.





 -Rakesh



 -Original Message-

 From: Srinivasan Veerapandian 

  1   2   3   >