We will retry with the new election algorithm and let you know the results.

Thanks for getting back so quickly.

Austin

On Sep 2, 2008, at 10:22 AM, Benjamin Reed wrote:

I think there is a race condition that is probably easy to get into with
the old leader election and a large number of servers:

1) Leader dies
2) Followers start looking for a new leader before all Followers have
abandoned the Leader
3) The Followers looking for a new leader see votes of Followers still
following the (now dead) Leader and start voting for the dead Leader
4) The dead Leader gets reelected.

For the old leader election a server should not vote for another server
that is not nominating himself.

I'll open a Jira.

ben

-----Original Message-----
From: Mahadev Konar [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 02, 2008 10:06 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Leader election stalled

Hi Austin,
Did you kill the leader process? It looks like that you didn't kill the
server since its responding to ruok. Is that true?

mahadev


On 9/2/08 9:56 AM, "Austin Shoemaker" <[EMAIL PROTECTED]> wrote:

Hi,

We have run into a situation where killing the leader results in
followers
perpetually trying to reelect that leader.

We have 11 zookeeper (2.2.1 from SF.net) servers and 256 clients
connecting
at random. We kill the leader and observe the impact, monitoring a
script
that repeatedly prints the responses to "ruok" and "stat". All servers
except the killed leader respond with "imok" and "ZooKeeperServer not
running", respectively.

About half of the time, each remaining server gets into a loop of
failing to
connect to the killed leader and then reelecting the killed leader.

Here is an example log, which is representative of similar logs on the
other
servers. We additionally logged connectivity during leader election.
If
anyone would like complete logs, let me know.

Thanks,

Austin Shoemaker

WARN  - [QuorumPeer:[EMAIL PROTECTED] - FOLLOWING
*WARN  - [QuorumPeer:[EMAIL PROTECTED] - Following /10.50.65.22:2889*
ERROR - [QuorumPeer:[EMAIL PROTECTED] - FIXMSG
java.net.ConnectException: Connection refused
*
.... cont'd ....*

ERROR - [QuorumPeer:[EMAIL PROTECTED] - FIXMSG
java.lang.Exception: shutdown Follower
       at
com.yahoo.zookeeper.server.quorum.Follower.shutdown(Follower.java: 364)
       at
com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:403)
WARN  - [QuorumPeer:[EMAIL PROTECTED] - LOOKING
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.22:2888
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.22:2888
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.21:2888
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.21:2888
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.12:2888
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.12:2888
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.11:2888
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.11:2888
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.12:2890
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.12:2890
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.11:2890
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.11:2890
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.22:2889
*WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Exception occurred
when
sending / receiving packet to / from /10.50.65.22:2889
java.net.SocketTimeoutException: Receive timed out
*WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to
/10.50.65.21:2890
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.21:2890
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.21:2889
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.21:2889
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.12:2889
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.12:2889
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Sending election
packet to /
10.50.65.11:2889
WARN - [QuorumPeer:[EMAIL PROTECTED] - ----> Received response from
/
10.50.65.11:2889
WARN  - [QuorumPeer:[EMAIL PROTECTED] - Election tally:
WARN  - [QuorumPeer:[EMAIL PROTECTED] - 8 -> 1
WARN  - [QuorumPeer:[EMAIL PROTECTED] - 4 -> 1
WARN  - [QuorumPeer:[EMAIL PROTECTED] - 7 -> 8
WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Election complete,
result.winner = 7
*WARN  - [QuorumPeer:[EMAIL PROTECTED] - ----> Election complete,
address
= /10.50.65.22:2889
WARN  - [QuorumPeer:[EMAIL PROTECTED] - FOLLOWING
WARN  - [QuorumPeer:[EMAIL PROTECTED] - Following /10.50.65.22:2889
ERROR - [QuorumPeer:[EMAIL PROTECTED] - FIXMSG
java.net.ConnectException: Connection refused
*        at java.net.PlainSocketImpl.socketConnect(Native Method)
       at
java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
       at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
       at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
       at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
       at java.net.Socket.connect(Socket.java:519)
       at

com .yahoo.zookeeper.server.quorum.Follower.followLeader(Follower.java:13
3)
       at
com.yahoo.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:399)


Reply via email to