[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928137#action_12928137
 ] 

Alexandre Hardy commented on ZOOKEEPER-917:
-------------------------------------------

The excerpts are extracted from {{hbase-0.20/hbase*.log}}, so the information 
should be readily available.
The tar file contents should be as follows:
{noformat}
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.10/
drwxr-xr-x ah/users          0 2010-11-03 13:33 192.168.130.10/hbase-0.20/
-rw-r--r-- ah/users          0 2010-11-02 14:42 
192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.out
-rw-r--r-- ah/users   62922921 2010-11-02 14:42 
192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.log
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.12/
drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.12/hbase-0.20/
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.13/
drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.13/hbase-0.20/
-rw-r--r-- ah/users   65903411 2010-11-02 14:42 
192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.log
-rw-r--r-- ah/users          0 2010-11-02 14:42 
192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.out
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.14/
drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.14/hbase-0.20/
-rw-r--r-- ah/users          0 2010-11-02 14:42 
192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.out
-rw-r--r-- ah/users   62835121 2010-11-02 14:42 
192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.log
{noformat}

The only logs that are missing are those for .11, but that should not influence 
the analysis of the leader election (I hope).

We are using monitoring software which determines when a zookeeper instance is 
no longer reachable, and automatically starts a fresh zookeeper instance as 
replacement. This software can determine the failure and start a new zookeeper 
instance fairly rapidly. Would it be better to delay the start of a fresh 
zookeeper instance to allow the existing instances to elect a new leader? If 
so, do you have any guidelines regarding this delay? (We are considering this 
approach, but would like to avoid it if possible).

{quote}
In your case, I'm still not sure why it happens because the initial zxid of 
node 1 is 4294967742 according to your excerpt. 
{quote}
That is indeed the key question that I am trying to find an answer for! :-)

> Leader election selected incorrect leader
> -----------------------------------------
>
>                 Key: ZOOKEEPER-917
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>    Affects Versions: 3.2.2
>         Environment: Cloudera distribution of zookeeper (patched to never 
> cache DNS entries)
> Debian lenny
>            Reporter: Alexandre Hardy
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>         Attachments: zklogs-20101102144159SAST.tar.gz
>
>
> We had three nodes running zookeeper:
>   * 192.168.130.10
>   * 192.168.130.11
>   * 192.168.130.14
> 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 
> (automated startup). The new node had not participated in any zookeeper 
> quorum previously. The node 192.148.130.11 was permanently removed from 
> service and could not contribute to the quorum any further (powered off).
> DNS entries were updated for the new node to allow all the zookeeper servers 
> to find the new node.
> The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
> it had not seen the latest zxid.
> This particular problem has not been verified with later versions of 
> zookeeper, and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to