[ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928171#action_12928171 ]
Flavio Junqueira commented on ZOOKEEPER-917: -------------------------------------------- The program I was using to open your logs was hiding some of the messages for some reason unknown to me. I now understand why the leader was elected in your case and the behavior is legitimate. Let me try to explain. We currently repeat the last notification sent to a given server upon reconnecting to it. This is to avoid problems with messages partially sent, and, assuming no further bugs, the protocol is resilient to messages duplicates. At the same time, a server A decides to follow another server B if it receives a message from B saying that B is leading and from a quorum saying that they are following, even if A is in a later election epoch. This mechanism is there to avoid A being locked out of the ensemble in the case it partitions away and comes back later. >From you logs, what happens is: # Fresh server 2 receives previous notifications from 0 and 1, and decide to lead; # Server 1 receives the last message from server 0 saying that it is following 2 (which was the previous leader), and the notification from 2 saying that it is leading. Server 1 consequently decides to follow 2; # Server 0 receives the last message from server 1 saying that it is following 2 (which was the previous leader), and the notification from 2 saying that it is leading. Server 0 consequently decides to follow 2. Now the main problem I see is that the followers accept the snapshot from the leader, and they shouldn't given that they have moved to a later epoch. I suspect that we currently allow a server to come back to an epoch it has been in the past to again avoid having a server locked out after being partitioned away and healing, but I need to do some further inspection. My overall take is that your case is unfortunately not legitimate, meaning that we don't currently provision for configuration changes. The case you expose in general constitutes a loss of quorum, and that violates one of our core assumptions. In more detail, a quorum supporting a leader must have a non-empty intersection with the quorum of servers that have accepted requests in the previous epoch. Wiping out the state of server 2, by replacing it with a fresh server, leads to the situation in which just one server contains all transactions accepted by a quorum (and possibly committed). If you hadn't replaced server 2 with a fresh server, then either server 2 would have been elected again just the same, and it would be fine because it was previously the leader, or it wouldn't have been elected because the leader was previously another server and the last notifications of 0 and 1 would be supporting a different server. On reconfigurations, we have talked about it (http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership), but we haven't made enough progress recently and it is currently not implemented. It would be great to get some help here. Let me know if this analysis makes any sense to you, please. > Leader election selected incorrect leader > ----------------------------------------- > > Key: ZOOKEEPER-917 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917 > Project: Zookeeper > Issue Type: Bug > Components: leaderElection, server > Affects Versions: 3.2.2 > Environment: Cloudera distribution of zookeeper (patched to never > cache DNS entries) > Debian lenny > Reporter: Alexandre Hardy > Priority: Critical > Fix For: 3.3.3, 3.4.0 > > Attachments: zklogs-20101102144159SAST.tar.gz > > > We had three nodes running zookeeper: > * 192.168.130.10 > * 192.168.130.11 > * 192.168.130.14 > 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 > (automated startup). The new node had not participated in any zookeeper > quorum previously. The node 192.148.130.11 was permanently removed from > service and could not contribute to the quorum any further (powered off). > DNS entries were updated for the new node to allow all the zookeeper servers > to find the new node. > The new node 192.168.130.13 was selected as the LEADER, despite the fact that > it had not seen the latest zxid. > This particular problem has not been verified with later versions of > zookeeper, and no attempt has been made to reproduce this problem as yet. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.