[ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928137#action_12928137 ]
Alexandre Hardy commented on ZOOKEEPER-917: ------------------------------------------- The excerpts are extracted from {{hbase-0.20/hbase*.log}}, so the information should be readily available. The tar file contents should be as follows: {noformat} drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.10/ drwxr-xr-x ah/users 0 2010-11-03 13:33 192.168.130.10/hbase-0.20/ -rw-r--r-- ah/users 0 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.out -rw-r--r-- ah/users 62922921 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.log drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.12/ drwxr-xr-x ah/users 0 2010-11-03 13:27 192.168.130.12/hbase-0.20/ drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.13/ drwxr-xr-x ah/users 0 2010-11-03 13:27 192.168.130.13/hbase-0.20/ -rw-r--r-- ah/users 65903411 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.log -rw-r--r-- ah/users 0 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.out drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.14/ drwxr-xr-x ah/users 0 2010-11-03 13:27 192.168.130.14/hbase-0.20/ -rw-r--r-- ah/users 0 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.out -rw-r--r-- ah/users 62835121 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.log {noformat} The only logs that are missing are those for .11, but that should not influence the analysis of the leader election (I hope). We are using monitoring software which determines when a zookeeper instance is no longer reachable, and automatically starts a fresh zookeeper instance as replacement. This software can determine the failure and start a new zookeeper instance fairly rapidly. Would it be better to delay the start of a fresh zookeeper instance to allow the existing instances to elect a new leader? If so, do you have any guidelines regarding this delay? (We are considering this approach, but would like to avoid it if possible). {quote} In your case, I'm still not sure why it happens because the initial zxid of node 1 is 4294967742 according to your excerpt. {quote} That is indeed the key question that I am trying to find an answer for! :-) > Leader election selected incorrect leader > ----------------------------------------- > > Key: ZOOKEEPER-917 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917 > Project: Zookeeper > Issue Type: Bug > Components: leaderElection, server > Affects Versions: 3.2.2 > Environment: Cloudera distribution of zookeeper (patched to never > cache DNS entries) > Debian lenny > Reporter: Alexandre Hardy > Priority: Critical > Fix For: 3.3.3, 3.4.0 > > Attachments: zklogs-20101102144159SAST.tar.gz > > > We had three nodes running zookeeper: > * 192.168.130.10 > * 192.168.130.11 > * 192.168.130.14 > 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 > (automated startup). The new node had not participated in any zookeeper > quorum previously. The node 192.148.130.11 was permanently removed from > service and could not contribute to the quorum any further (powered off). > DNS entries were updated for the new node to allow all the zookeeper servers > to find the new node. > The new node 192.168.130.13 was selected as the LEADER, despite the fact that > it had not seen the latest zxid. > This particular problem has not been verified with later versions of > zookeeper, and no attempt has been made to reproduce this problem as yet. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.