Re: Leader election logging during reconfiguration

2019-07-30 Thread Michael Han
>> we should measure the total time more accurately +1 - it would be good to have a new metric to measure reconfiguration time, and leaving existing LE time metric dedicated to measure the conventional FLE time. Mixing both (as of today) will provide some confusing insights on how long the

Re: Leader election logging during reconfiguration

2019-07-29 Thread Alexander Shraer
Please see comments inline. Thanks, Alex On Mon, Jul 29, 2019 at 5:29 PM Karolos Antoniadis wrote: > Hi ZooKeeper developers, > > ZooKeeper seems to be logging a "*LEADER ELECTION TOOK*" message even > though no leader election takes place during a reconfiguration. > > This can be reproduced

Re: Leader election

2018-12-12 Thread Michael Han
>> Can we reduce this time by configuring "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong guarantee on this time bound? It's not possible to guarantee the time bound, because of FLP impossibility (reliable failure detection is not possible in async environment). Though

Re: Leader election

2018-12-11 Thread Michael Borokhovich
Thanks a lot for sharing the design, Ted. It is very helpful. Will check what is applicable to our case and let you know in case of questions. On Mon, Dec 10, 2018 at 23:37 Ted Dunning wrote: > One very useful way to deal with this is the method used in MapR FS. The > idea is that ZK should

Re: Leader election

2018-12-10 Thread Ted Dunning
One very useful way to deal with this is the method used in MapR FS. The idea is that ZK should only be used rarely and short periods of two leaders must be tolerated, but other data has to be written with absolute consistency. The method that we chose was to associate an epoch number with every

Re: Leader election

2018-12-10 Thread Michael Borokhovich
Thanks, Maciej. That sounds good. We will try playing with the parameters and have at least a known upper limit on the inconsistency interval. On Fri, Dec 7, 2018 at 2:11 AM Maciej Smoleński wrote: > On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich > wrote: > > > We are planning to run

Re: Leader election

2018-12-10 Thread Michael Borokhovich
Yes, I agree, our system should be able to tolerate two leaders for a short and bounded period of time. Thank you for the help! On Thu, Dec 6, 2018 at 11:09 AM Jordan Zimmerman wrote: > > it seems like the > > inconsistency may be caused by the partition of the Zookeeper cluster > > itself > >

Re: Leader election

2018-12-10 Thread Michael Borokhovich
Makes sense. Thanks, Ted. We will design our system to cope with the short periods where we might have two leaders. On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning wrote: > ZK is able to guarantee that there is only one leader for the purposes of > updating ZK data. That is because all commits have

Re: Leader election

2018-12-07 Thread Maciej Smoleński
On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich wrote: > We are planning to run Zookeeper nodes embedded with the client nodes. > I.e., each client runs also a ZK node. So, network partition will > disconnect a ZK node and not only the client. > My concern is about the following statement

Re: Leader election

2018-12-06 Thread Ted Dunning
ZK is able to guarantee that there is only one leader for the purposes of updating ZK data. That is because all commits have to originate with the current quorum leader and then be acknowledged by a quorum of the current cluster. IF the leader can't get enough acks, then it has de facto lost

Re: Leader election

2018-12-06 Thread Michael Borokhovich
We are planning to run Zookeeper nodes embedded with the client nodes. I.e., each client runs also a ZK node. So, network partition will disconnect a ZK node and not only the client. My concern is about the following statement from the ZK documentation: "Timeliness: The clients view of the system

Re: Leader election

2018-12-06 Thread Michael Han
Tweak timeout is tempting as your solution might work most of the time yet fail in certain cases (which others have pointed out). If the goal is absolute correctness then we should avoid timeout, which does not guarantee correctness as it only makes the problem hard to manifest. Fencing is the

Re: Leader election

2018-12-06 Thread Jordan Zimmerman
> Old service leader will detect network partition max 15 seconds after it > happened. If the old service leader is in a very long GC it will not detect the partition. In the face of VM pauses, etc. it's not possible to avoid 2 leaders for a short period of time. -JZ

Re: Leader election

2018-12-06 Thread Maciej Smoleński
Hello, Ensuring reliability requires to use consensus directly in your service or change the service to use distributed log/journal (e.g. bookkeeper). However following idea is simple and in many situation good enough. If you configure session timeout to 15 seconds - then zookeeper client will

Re: Leader election

2018-12-06 Thread Jordan Zimmerman
> it seems like the > inconsistency may be caused by the partition of the Zookeeper cluster > itself Yes - there are many ways in which you can end up with 2 leaders. However, if properly tuned and configured, it will be for a few seconds at most. During a GC pause no work is being done anyway.

Re: Leader election

2018-12-06 Thread Michael Borokhovich
Thanks Jordan, Yes, I will try Curator. Also, beyond the problem described in the Tech Note, it seems like the inconsistency may be caused by the partition of the Zookeeper cluster itself. E.g., if a "leader" client is connected to the partitioned ZK node, it may be notified not at the same time

Re: Leader election

2018-12-06 Thread Jordan Zimmerman
It is not possible to achieve the level of consistency you're after in an eventually consistent system such as ZooKeeper. There will always be an edge case where two ZooKeeper clients will believe they are leaders (though for a short period of time). In terms of how it affects Apache Curator,

回复:Re: Leader election

2018-12-06 Thread 毛蛤丝
tor.java#L340 it can guarantee exactly one leader all the time(EPHEMERAL_SEQUENTIAL zk-node) which has not too much correlations with the network partitions of zk ensembles itself. I guess,haha! - 原始邮件 - 发件人:Michael Borokhovich 收件人:dev@zookeeper.apache.org, maoling199210...@sina.com

Re: Leader election

2018-12-05 Thread Enrico Olivelli
Michael, Leader election is not enough. You must have some mechanism to fence off the partitioned leader. If you are building a replicated state machine Apache Zookeeper + Apache Bookkeeper can be a good choice See this just an example: https://github.com/ivankelly/bookkeeper-tutorial This is

Re: Leader election

2018-12-05 Thread Michael Borokhovich
Thanks, I will check it out. However, do you know if it gives any better guarantees? Can it happen that we end up with 2 leaders or 0 leader for some period of time (for example, during network delays/partitions)? On Wed, Dec 5, 2018 at 10:54 PM 毛蛤丝 wrote: > suggest you use the ready-made