[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928171#action_12928171
 ] 

Flavio Junqueira commented on ZOOKEEPER-917:
--------------------------------------------

The program I was using to open your logs was hiding some of the messages for 
some reason unknown to me. I now understand why the leader was elected in your 
case and the behavior is legitimate. Let me try to explain.

We currently repeat the last notification sent to a given server upon 
reconnecting to it. This is to avoid problems with messages partially sent, 
and, assuming no further bugs, the protocol is resilient to messages 
duplicates. At the same time, a server A decides to follow another server B if 
it receives a message from B saying that B is leading and from a quorum saying 
that they are following, even if A is in a later election epoch. This mechanism 
is there to avoid A being locked out of the ensemble in the case it partitions 
away and comes back later. 

>From you logs, what happens is:

# Fresh server 2 receives previous notifications from 0 and 1, and decide to 
lead;
# Server 1 receives the last message from server 0 saying that it is following 
2 (which was the previous leader), and the notification from 2 saying that it 
is leading. Server 1 consequently decides to follow 2;
# Server 0 receives the last message from server 1 saying that it is following 
2 (which was the previous leader), and the notification from 2 saying that it 
is leading. Server 0 consequently decides to follow 2.

Now the main problem I see is that the followers accept the snapshot from the 
leader, and they shouldn't given that they have moved to a later epoch. I 
suspect that we currently allow a server to come back to an epoch it has been 
in the past to again avoid having a server locked out after being partitioned 
away and healing, but I need to do some further inspection.

My overall take is that your case is unfortunately not legitimate, meaning that 
we don't currently provision for configuration changes. The case you expose in 
general constitutes a loss of quorum, and that violates one of our core 
assumptions. In more detail, a quorum supporting a leader must have a non-empty 
intersection with the quorum of servers that have accepted requests in the 
previous epoch. Wiping out the state of server 2, by replacing it with a fresh 
server, leads to the situation in which just one server contains all 
transactions accepted by a quorum (and possibly committed). If you hadn't 
replaced server 2 with a fresh server, then either server 2 would have been 
elected again just the same, and it would be fine because it was previously the 
leader, or it wouldn't have been elected because the leader was previously 
another server and the last notifications of 0 and 1 would be supporting a 
different server.

On reconfigurations, we have talked about it 
(http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership), but we haven't 
made enough progress recently and it is currently not implemented. It would be 
great to get some help here.

Let me know if this analysis makes any sense to you, please.

> Leader election selected incorrect leader
> -----------------------------------------
>
>                 Key: ZOOKEEPER-917
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>    Affects Versions: 3.2.2
>         Environment: Cloudera distribution of zookeeper (patched to never 
> cache DNS entries)
> Debian lenny
>            Reporter: Alexandre Hardy
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>         Attachments: zklogs-20101102144159SAST.tar.gz
>
>
> We had three nodes running zookeeper:
>   * 192.168.130.10
>   * 192.168.130.11
>   * 192.168.130.14
> 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 
> (automated startup). The new node had not participated in any zookeeper 
> quorum previously. The node 192.148.130.11 was permanently removed from 
> service and could not contribute to the quorum any further (powered off).
> DNS entries were updated for the new node to allow all the zookeeper servers 
> to find the new node.
> The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
> it had not seen the latest zxid.
> This particular problem has not been verified with later versions of 
> zookeeper, and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to