[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888834#action_12888834
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-790:
--------------------------------------------------

Thank you both for all the information. I have been able to reproduce and find 
the source of the bug, but I don't have a patch yet. The problem is deeper than 
I thought originally. Let me show you what is going on:

I'm including an excerpt of logs from two runs: one good and one bad. The first 
run is good (see the excerpt below). I have killed the follower and restarted 
it as Vishal suggested. When it comes back, it declares itself as leader, also 
as Vishal and Travis observed.  However, different from what Vishal and Travis 
observed, it drops leadership and follows happily the leader right after.

{noformat}
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@654] - LEADING
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my 
state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LEADING (my 
state)
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@54] - TCP NoDelay set to: true
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:zookeeper.version=3.4.0--1, built on 07/15/2010 10:36 GMT
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:host.name=XXXXXXXXXX.com
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.version=1.6.0_04
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.vendor=Sun Microsystems Inc.
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.home=/usr/java/jdk1.6.0_04/jre
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.class.path=.XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.library.path= XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.io.tmpdir=/tmp
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.compiler=<NA>
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:os.name=Linux
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:os.arch=amd64
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:os.version=2.6.18-53.1.21.el5
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:user.name=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:user.home=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:user.dir=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created 
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000 
datadir /XXXXXXXXX/zookeeper/version-2 snapdir /XXXXXXXX/zookeeper/version-2
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot 
/XXXXXXXX/zookeeper/version-2/snapshot.100113340
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot 
/XXXXXXXX/zookeeper/version-2/snapshot.100113340
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:filetxnsnap...@208] - Snapshotting: 
10011f748
INFO  - [SessionTracker:zookeeperser...@315] - Expiring session 
0x229d6a9e0ca0000, timeout of 10000ms exceeded
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@394] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Waiting for a quorum of 
followers, only synced with: 2: 
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:394)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:317)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:657)
INFO  - [Thread-10:leader$learnercnxaccep...@243] - exception while shutting 
down acceptor: java.net.SocketException: Socket closed
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:finalrequestproces...@378] - 
shutdown of request processor complete
INFO  - [SyncThread:2:syncrequestproces...@151] - SyncRequestProcessor exited!
INFO  - [CommitProcessor:2:commitproces...@148] - CommitProcessor exited loop!
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@620] - LOOKING
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot 
/XXXXXXXXX/zookeeper/version-2/snapshot.10011f748
INFO  - [SessionTracker:sessiontrackeri...@158] - SessionTrackerImpl exited 
loop!
INFO  - [ProcessThread:-1:preprequestproces...@385] - Processed session 
termination for sessionid: 0x229d6a9e0ca0000
ERROR - [ProcessThread:-1:nioservercnxn$factor...@87] - Thread 
Thread[ProcessThread:-1,5,main] died
java.lang.NullPointerException
        at 
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:71)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:435)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:fastleaderelect...@663] - New 
election. My id =  2, Proposed zxid = 4296144712
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 2 
(n.leader), 4296144712 (n.zxid), 2 (n.round), LOOKING (n.state), 2 (n.sid), 
LOOKING (my state)
INFO  - [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server 
identifier, so dropping the connection: (3, 2)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my 
state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING 
(my state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING 
(my state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 1 (n.sid), LOOKING 
(my state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my 
state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my 
state)
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@642] - FOLLOWING
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@72] - TCP NoDelay set to: 
true
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created 
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000 
datadir /XXXXXXXXX/zookeeper/version-2 snapdir /XXXXXXXXX/zookeeper/version-2
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@285] - Getting a snapshot 
from leader
FATAL - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@314] - Setting leader epoch 1
WARN  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@116] - Got zxid 0x100162711 
expected 0x1
WARN  - [SyncThread:2:filetxn...@196] - Creating new log file: 100162711
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:filetxnsnap...@208] - Snapshotting: 
100164c43
{noformat}


Now, the following run is bad. I followed the same procedure, but this time we 
get the epoch problem. Looking carefully at the logs, you'll see that the main 
difference is that this time the zookeeper server created a new log file: 
log.200000001. This is bad because the zookeeper server will use the zxid in 
the latest txn log file to determine its last epoch. Given that the current 
leader is in epoch 1, it will correctly refuse to talk to the leader as we 
observe below.

{noformat}
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@654] - LEADING
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@54] - TCP NoDelay set to: true
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:zookeeper.version=3.4.0--1, built on 07/15/2010 10:36 GMT
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:host.name=XXXXXXXX.com
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.version=1.6.0_05
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.vendor=Sun Microsystems Inc.
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.home=/usr/java/jdk1.6.0_05/jre
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.class.path=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.library.path=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.io.tmpdir=/tmp
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:java.compiler=<NA>
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:os.name=Linux
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:os.arch= amd64
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:os.version=2.6.18-92.1.13.el5.perfctr.2.6.36
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:user.name=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:user.home=XXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:environm...@97] - Server 
environment:user.dir=XXXXXXXXX
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created 
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000 
datadir /xxxxxxx/zookeeper/version-2 snapdir /XXXXXXX/zookeeper/version-2
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot 
/XXXXXXX/zookeeper/version-2/snapshot.1000469b4
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot 
/XXXXXXX/zookeeper/version-2/snapshot.1000469b4
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:filetxnsnap...@208] - Snapshotting: 
100055cec
INFO  - [SessionTracker:zookeeperser...@315] - Expiring session 
0x129d6b61b5b0000, timeout of 10000ms exceeded
INFO  - [ProcessThread:-1:preprequestproces...@385] - Processed session 
termination for sessionid: 0x129d6b61b5b0000
WARN  - [SyncThread:1:filetxn...@196] - Creating new log file: log.200000001
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lea...@394] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Waiting for a quorum of 
followers, only synced with: 1: 
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:394)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:317)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:657)
INFO  - [ProcessThread:-1:preprequestproces...@119] - PrepRequestProcessor 
exited loop!
INFO  - [CommitProcessor:1:commitproces...@148] - CommitProcessor exited loop!
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:finalrequestproces...@378] - 
shutdown of request processor complete
INFO  - [SyncThread:1:syncrequestproces...@151] - SyncRequestProcessor exited!
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@620] - LOOKING
INFO  - [Thread-8:leader$learnercnxaccep...@243] - exception while shutting 
down acceptor: java.net.SocketException: Socket closed
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:files...@82] - Reading snapshot 
/XXXXXXXX/zookeeper/version-2/snapshot.100055cec
INFO  - [SessionTracker:sessiontrackeri...@158] - SessionTrackerImpl exited 
loop!
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:fastleaderelect...@663] - New 
election. My id =  1, Proposed zxid = 8589934593
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 1 
(n.leader), 8589934593 (n.zxid), 2 (n.round), LOOKING (n.state), 1 (n.sid), 
LOOKING (my state)
INFO  - [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server 
identifier, so dropping the connection: (2, 1)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING 
(my state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), LEADING (n.state), 3 (n.sid), LOOKING (my 
state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING 
(my state)
INFO  - [WorkerReceiver Thread:fastleaderelect...@496] - Notification: 3 
(n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING 
(my state)
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:quorump...@642] - FOLLOWING
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:lear...@72] - TCP NoDelay set to: 
true
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:zookeeperser...@151] - Created 
server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000 
datadir /xxxxxxx/zookeeper/version-2 snapdir /xxxxxxx/zookeeper/version-2
FATAL - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@71] - Leader epoch 1 is 
less than our epoch 2
WARN  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@82] - Exception when 
following the leader
java.io.IOException: Error: Epoch of leader is lower
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
INFO  - [QuorumPeer:/0:0:0:0:0:0:0:0:10218:follo...@166] - shutdown called
java.lang.Exception: shutdown Follower
        at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
{noformat}

Fixing it is not as simple as setting the current zxid to 200000001 after we 
hear from a quorum because the request processor pipeline starts complaining 
about an attempt to commit a future proposal.

> Last processed zxid set prematurely while establishing leadership
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-790
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.1
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Flavio Paiva Junqueira
>            Priority: Blocker
>             Fix For: 3.3.2, 3.4.0
>
>         Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2
>
>
> The leader code is setting the last processed zxid to the first of the new 
> epoch even before connecting to a quorum of followers. Because the leader 
> code sets this value before connecting to a quorum of followers 
> (Leader.java:281) and the follower code throws an IOException 
> (Follower.java:73) if the leader epoch is smaller, we have that when the 
> false leader drops leadership and becomes a follower, it finds a smaller 
> epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to