[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-11-20 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934202#action_12934202
 ] 

Flavio Junqueira commented on ZOOKEEPER-335:


Radu, It sounds like the problem you mention has been resolved in 
ZOOKEEPER-790. I'm not sure which version you're using, but perhaps you should 
consider moving to 3.3.2.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz, 
 ZOOKEEPER-790.travis.log.bz2


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-07-14 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888684#action_12888684
 ] 

Travis Crawford commented on ZOOKEEPER-335:
---

Unfortunately I still observed the Leader epoch issue and needed to manually 
force a leader election for the cluster to recover. This test was performed 
with the following base+patches, applied in the order listed.

Zookeeper 3.3.1
ZOOKEEPER-744
ZOOKEEPER-790


{code}
2010-07-15 02:43:57,181 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot /data/zookeeper/version-2/snapshot.231ac2
2010-07-15 02:43:57,384 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 154618826848
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
154618826848, 4, 1, LOOKING, LOOKING, 1
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 
146030952153, 3, 1, LOOKING, LEADING, 2
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 
146030952153, 3, 1, LOOKING, FOLLOWING, 3
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2
2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 23 is less than our epoch 24
2010-07-15 02:43:57,387 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader 
java.io.IOException: Error: Epoch of leader is lower
at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-07-15 02:43:57,387 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called 
java.lang.Exception: shutdown Follower
at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
{code}


I followed the recipe @vishal provided for recreating.

(a) Stop one follower in a three node cluster
(b) Get some tea while it falls behind
(c) Start the node stopped in (a).


These timestamps show where the follower was stopped. It also shows when it was 
turned back on.

{code}
2010-07-15 02:35:36,398 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 
0x229aa13cfc6276b with negotiated timeout 1 for client /10.209.45.114:34562
2010-07-15 02:39:18,907 - INFO  [main:quorumpeercon...@90] - Reading 
configuration from: /etc/zookeeper/conf/zoo.cfg
{code}


This timestamp is the first ``Leader epoch`` line. Everything between these two 
points will be the interesting bits.

{code}
2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 23 is less than our epoch 24
{code}

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-22 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881168#action_12881168
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-335:
--

Thanks for detailed assessment, Vishal. In Step b, the fact that the process 
believes it is the leader is not a problem, and it happens because we queue 
notification messages during leader election. 

The real issue is that leader code is setting the last processed zxid to the 
first of the new epoch even before connecting to a quorum of followers. Because 
the leader code sets this value before connecting to a quorum of followers 
(Leader.java:281) and the follower code throws an IOException 
(Follower.java:73) if the leader epoch is smaller, we have that when the false 
leader drops leadership and becomes a follower, it finds a smaller epoch and 
kills itself.

I noticed that this follower check was not there before (not present in 3.0 
branch), and it might have been introduced when we did the observer 
reorganization. For now I propose that we move line Leader.java:281 to 
Leader.java:470. It simply changes the point in which we set the last processed 
zxid to one in which we know that  a quorum of followers supports the leader. I 
reasoned a bit about it and verified that tests pass.

A patch for the change I'm proposing is trivial, but a unit test will require 
some work, so I'd rather hear opinions first. Also, please note that this 
problem is not related to the topic of this jira, so we might consider working 
on a different jira from this point on. 



 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-22 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881236#action_12881236
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


Vishal, if Flavio provides you with a patch could you apply it and verify with 
your configuration?

Flavio, please provide an initial patch that people could use to verify. We'll 
hold off on a release until you add the test(s), but this would be great to 
start with.

Thanks all for helping to track this down!

I'd like to fast track a 3.3.2 release, so if possible please make this a 
priority.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-22 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881244#action_12881244
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-335:
--

I have created a new jira for this issue: ZOOKEEPER-790. There is a patch there.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-22 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881280#action_12881280
 ] 

Vishal K commented on ZOOKEEPER-335:


I will try out the patch. FYI I am using 3.3.0.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880917#action_12880917
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


vishal comment on list:


I might be wrong here, but let me try to chip in my few cents.

I think the problem is in LearnerHandler.java at the leader fo this
Follower.

/* see what other packets from the proposal
 * and tobeapplied queues need to be sent
 * and then decide if we can just send a DIFF
 * or we actually need to send the whole snapshot
 */
long leaderLastZxid = leader.startForwarding(this, updates);
--- this leaderLastZxid returned is probably incorrect.
// a special case when both the ids are the same
if (peerLastZxid == leaderLastZxid) {
packetToSend = Leader.DIFF;
zxidToSend = leaderLastZxid;
}

QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
leaderLastZxid, null, null);
oa.writeRecord(newLeaderQP, packet);
bufferedOutput.flush()


 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880918#action_12880918
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


vishal comment on list:


Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that
the follower received the epoch before restart.


 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt
Please use the JIRA for followups, otw it's hard to track 
progress/status. thanks.


Patrick

On 06/18/2010 04:45 PM, Vishal K wrote:

Hi Flavio,

I have 3 set of logs and they all seem to indicate two problems on the
misbehaving follower:

Problem 1: Expected zxid is incorrect
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1

Note expected zxid is always 0x1 (lastQueued is always 0?)

Problem 2: While joining the cluster expected epoch is 1 higher than seen
earlier
=14991 [QuorumPeer:/0.0.0.0:2181] FATAL
org.apache.zookeeper.server.quorum.Learner  - Leader epoch 7 is less than
our epoch 8

-Vishal


[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880919#action_12880919
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


Vishal comment on list:


Hi Flavio,

I have 3 set of logs and they all seem to indicate two problems on the
misbehaving follower:

Problem 1: Expected zxid is incorrect
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1

Note expected zxid is always 0x1 (lastQueued is always 0?)

Problem 2: While joining the cluster expected epoch is 1 higher than seen
earlier
=14991 [QuorumPeer:/0.0.0.0:2181] FATAL
org.apache.zookeeper.server.quorum.Learner  - Leader epoch 7 is less than
our epoch 8

-Vishal


 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881028#action_12881028
 ] 

Vishal K commented on ZOOKEEPER-335:


Hi,

I enabled tracing and did some more debugging. Looks like the restarted peer 
(and trying to join the cluster) determines that it is a leader and increments 
its epoch. However, rest of the nodes don't acknowledge this node as the 
leader, and hence, have an older epoch. I will attache the log. Unfortunately, 
I don't have traces from other nodes. I will repeat the experiment later and 
attache logs from other nodes. 

Scenario:
- Form a 3 node cluster. This is not just ZK cluster. It also involves our 
application cluster that uses ZK.
- Kill one of the follower
- After a minute or so restart follower
- Follower rejects leader with Leader epoch y is less than our epoch y + 1

From logs:

a) Peer X restarts and starts leader election.
a) For a small window of time, X thinks that it is the new leader! During this 
window, for some reason, rest of the nodes tell X that they are also trying to 
find a leader. I.e., all 3 nodes are in LOOKING state. After seeing that all 3 
nodes are in LOOKING state, X decides to be a leader?

   155 2010-06-20 23:22:46,421 - DEBUG [WorkerSender 
Thread:quorumcnxmana...@346] - Opening channel to server 1
   156 2010-06-20 23:22:46,423 - DEBUG [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new 
notification message. My id = 0
   157 2010-06-20 23:22:46,424 - INFO  
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 0, 
77309411393, 1, 0, LOOKING, LOOKING, 0
   158 2010-06-20 23:22:46,424 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 0, proposed id: 0, 
zxid: 77309411393, proposed zxid: 77309411393
   159 2010-06-20 23:22:46,424 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 0, 
Proposed leader = 0, Porposed zxid = 77309411393, Proposed epoch = 1
   160 2010-06-20 23:22:46,426 - INFO  [WorkerSender 
Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the 
connection: (1, 0)
   161 2010-06-20 23:22:46,426 - DEBUG [WorkerSender 
Thread:quorumcnxmana...@346] - Opening channel to server 2
   162 2010-06-20 23:22:46,427 - DEBUG [Thread-1:quorumcnxmanager$liste...@445] 
- Connection request /192.168.1.182:46701
   163 2010-06-20 23:22:46,427 - DEBUG [Thread-1:quorumcnxmanager$liste...@448] 
- Connection request: 0
   164 2010-06-20 23:22:46,428 - DEBUG 
[Thread-1:quorumcnxmanager$sendwor...@504] - Address of remote peer: 1
   165 2010-06-20 23:22:46,428 - INFO  [WorkerSender 
Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the 
connection: (2, 0)
   166 2010-06-20 23:22:46,431 - DEBUG [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new 
notification message. My id = 0
   167 2010-06-20 23:22:46,432 - INFO  
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 1, 
77309411372, 1, 0, LOOKING, LOOKING, 1
   168 2010-06-20 23:22:46,432 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 1, proposed id: 0, 
zxid: 77309411372, proposed zxid: 77309411393
   169 2010-06-20 23:22:46,432 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 1, 
Proposed leader = 1, Porposed zxid = 77309411372, Proposed epoch = 1
   170 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$liste...@445] 
- Connection request /192.168.1.183:44310
   171 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$liste...@448] 
- Connection request: 0
   172 2010-06-20 23:22:46,436 - DEBUG 
[Thread-1:quorumcnxmanager$sendwor...@504] - Address of remote peer: 2
   173 2010-06-20 23:22:46,440 - DEBUG [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new 
notification message. My id = 0
   174 2010-06-20 23:22:46,440 - INFO  
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 2, 
7301097, 1, 0, LOOKING, LOOKING, 2
   175 2010-06-20 23:22:46,440 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 2, proposed id: 0, 
zxid: 7301097, proposed zxid: 77309411393
   176 2010-06-20 23:22:46,441 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 2, 
Proposed leader = 2, Porposed zxid = 7301097, Proposed epoch = 1
   177 2010-06-20 23:22:46,441 - INFO  
[QuorumPeer:/0.0.0.0:2181:quorump...@647] - LEADING

b) As a result X  increments its epoch. Worse, since this node decided to be a 
leader, it starts doing transactions. The first set of transactions start 
removing all ephemeral nodes. But these transactions are only done locally. 
Other peers do not ack these transactions since they know that this peer is not 
the leader.

c) After a few seconds (8 secs), X relinquishes leadership since it does not 
receive any ack from rest of 

[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-18 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880202#action_12880202
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-335:
--

Mike, There is one thing I don't understand. From the logs, it looks like 
servers 1 and 3 are proposing a zxid of 0 (second field of notification) during 
election, which makes me think that they had no state at all:

{noformat}
2010-06-17 14:35:40,714 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 2, 
8589934884, 2, 2, LOOKING, LOOKING, 2
2010-06-17 14:35:40,714 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 0, 
1, 2, LOOKING, FOLLOWING, 1
2010-06-17 14:35:40,714 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 0, 
1, 2, LOOKING, LEADING, 3
{noformat}

Server 2 on the other hand had accepted updates based on the zxid it proposes. 
Were they supposed to have no state at all? Have you deleted your logs and 
snapshots before restarting the servers? 

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-18 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-335:
--

Guys, I don't see enough information in these logs to determine what's going 
on. Let me tell you what I'm seeing so that perhaps other folks can help me out 
here. 

One part of the log that is suspicious is this one:

{noformat}
=6693 [QuorumPeer:/0.0.0.0:2181] WARN  
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected 0x1
=6693 [QuorumPeer:/0.0.0.0:2181] WARN  
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected 0x1
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32]

* NODE RESTARTED HERE **
{noformat}

Before being restarted, the bad node receives a proposal with zxid 3,1 and it 
expects 0,1. Next in the logs after being restarted, I can see that it is 
complaining that it has epoch 4 and the leader 3. Something strange apparently 
happened during the restart. It also seems to be the case that the node was 
being able to talk to the others (first entries in the log before the excerpt 
above).

Do you guys see anything I'm overlooking?

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-18 Thread Vishal K
I might be wrong here, but let me try to chip in my few cents.

I think the problem is in LearnerHandler.java at the leader fo this
Follower.

/* see what other packets from the proposal
 * and tobeapplied queues need to be sent
 * and then decide if we can just send a DIFF
 * or we actually need to send the whole snapshot
 */
long leaderLastZxid = leader.startForwarding(this, updates);
--- this leaderLastZxid returned is probably incorrect.
// a special case when both the ids are the same
if (peerLastZxid == leaderLastZxid) {
packetToSend = Leader.DIFF;
zxidToSend = leaderLastZxid;
}

QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
leaderLastZxid, null, null);
oa.writeRecord(newLeaderQP, packet);
bufferedOutput.flush()


On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) 
j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320]

 Flavio Paiva Junqueira commented on ZOOKEEPER-335:
 --

 Guys, I don't see enough information in these logs to determine what's
 going on. Let me tell you what I'm seeing so that perhaps other folks can
 help me out here.

 One part of the log that is suspicious is this one:

 {noformat}
 =6693 [QuorumPeer:/0.0.0.0:2181] WARN
  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected
 0x1
 =6693 [QuorumPeer:/0.0.0.0:2181] WARN
  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected
 0x1
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32]

 * NODE RESTARTED HERE **
 {noformat}

 Before being restarted, the bad node receives a proposal with zxid 3,1
 and it expects 0,1. Next in the logs after being restarted, I can see that
 it is complaining that it has epoch 4 and the leader 3. Something strange
 apparently happened during the restart. It also seems to be the case that
 the node was being able to talk to the others (first entries in the log
 before the excerpt above).

 Do you guys see anything I'm overlooking?

  zookeeper servers should commit the new leader txn to their logs.
  -
 
  Key: ZOOKEEPER-335
  URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
  Project: Zookeeper
   Issue Type: Bug
   Components: server
 Affects Versions: 3.1.0
 Reporter: Mahadev konar
 Assignee: Mahadev konar
 Priority: Blocker
  Fix For: 3.4.0
 
  Attachments: zk.log.gz, zklogs.tar.gz
 
 
  currently the zookeeper followers do not commit the new leader election.
 This will cause problems in a failure scenarios with a follower acking to
 the same leader txn id twice, which might be two different intermittent
 leaders and allowing them to propose two different txn's of the same zxid.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-18 Thread Vishal K
Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that
the follower received the epoch before restart.

On Fri, Jun 18, 2010 at 6:20 PM, Vishal K vishalm...@gmail.com wrote:

 I might be wrong here, but let me try to chip in my few cents.

 I think the problem is in LearnerHandler.java at the leader fo this
 Follower.

 /* see what other packets from the proposal
  * and tobeapplied queues need to be sent
  * and then decide if we can just send a DIFF
  * or we actually need to send the whole snapshot
  */
 long leaderLastZxid = leader.startForwarding(this, updates);
 --- this leaderLastZxid returned is probably incorrect.
 // a special case when both the ids are the same
 if (peerLastZxid == leaderLastZxid) {
 packetToSend = Leader.DIFF;
 zxidToSend = leaderLastZxid;
 }

 QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
 leaderLastZxid, null, null);
 oa.writeRecord(newLeaderQP, packet);
 bufferedOutput.flush()



 On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) 
 j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320]

 Flavio Paiva Junqueira commented on ZOOKEEPER-335:
 --

 Guys, I don't see enough information in these logs to determine what's
 going on. Let me tell you what I'm seeing so that perhaps other folks can
 help me out here.

 One part of the log that is suspicious is this one:

 {noformat}
 =6693 [QuorumPeer:/0.0.0.0:2181] WARN
  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected
 0x1
 =6693 [QuorumPeer:/0.0.0.0:2181] WARN
  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected
 0x1
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32]

 * NODE RESTARTED HERE **
 {noformat}

 Before being restarted, the bad node receives a proposal with zxid 3,1
 and it expects 0,1. Next in the logs after being restarted, I can see that
 it is complaining that it has epoch 4 and the leader 3. Something strange
 apparently happened during the restart. It also seems to be the case that
 the node was being able to talk to the others (first entries in the log
 before the excerpt above).

 Do you guys see anything I'm overlooking?

  zookeeper servers should commit the new leader txn to their logs.
  -
 
  Key: ZOOKEEPER-335
  URL:
 https://issues.apache.org/jira/browse/ZOOKEEPER-335
  Project: Zookeeper
   Issue Type: Bug
   Components: server
 Affects Versions: 3.1.0
 Reporter: Mahadev konar
 Assignee: Mahadev konar
 Priority: Blocker
  Fix For: 3.4.0
 
  Attachments: zk.log.gz, zklogs.tar.gz
 
 
  currently the zookeeper followers do not commit the new leader election.
 This will cause problems in a failure scenarios with a follower acking to
 the same leader txn id twice, which might be two different intermittent
 leaders and allowing them to propose two different txn's of the same zxid.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.





Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-18 Thread Vishal K
Hi Flavio,

I have 3 set of logs and they all seem to indicate two problems on the
misbehaving follower:

Problem 1: Expected zxid is incorrect
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1

Note expected zxid is always 0x1 (lastQueued is always 0?)

Problem 2: While joining the cluster expected epoch is 1 higher than seen
earlier
=14991 [QuorumPeer:/0.0.0.0:2181] FATAL
org.apache.zookeeper.server.quorum.Learner  - Leader epoch 7 is less than
our epoch 8

-Vishal

On Fri, Jun 18, 2010 at 6:33 PM, Vishal K vishalm...@gmail.com wrote:


 Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that
 the follower received the epoch before restart.


 On Fri, Jun 18, 2010 at 6:20 PM, Vishal K vishalm...@gmail.com wrote:

 I might be wrong here, but let me try to chip in my few cents.

 I think the problem is in LearnerHandler.java at the leader fo this
 Follower.

 /* see what other packets from the proposal
  * and tobeapplied queues need to be sent
  * and then decide if we can just send a DIFF
  * or we actually need to send the whole snapshot
  */
 long leaderLastZxid = leader.startForwarding(this, updates);
 --- this leaderLastZxid returned is probably incorrect.
 // a special case when both the ids are the same
 if (peerLastZxid == leaderLastZxid) {
 packetToSend = Leader.DIFF;
 zxidToSend = leaderLastZxid;
 }

 QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
 leaderLastZxid, null, null);
 oa.writeRecord(newLeaderQP, packet);
 bufferedOutput.flush()



 On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) 
 j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320]

 Flavio Paiva Junqueira commented on ZOOKEEPER-335:
 --

 Guys, I don't see enough information in these logs to determine what's
 going on. Let me tell you what I'm seeing so that perhaps other folks can
 help me out here.

 One part of the log that is suspicious is this one:

 {noformat}
 =6693 [QuorumPeer:/0.0.0.0:2181] WARN
  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected
 0x1
 =6693 [QuorumPeer:/0.0.0.0:2181] WARN
  org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30001 expected
 0x1
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24]
 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32]

 * NODE RESTARTED HERE **
 {noformat}

 Before being restarted, the bad node 

[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-17 Thread Mike Solomon (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879967#action_12879967
 ] 

Mike Solomon commented on ZOOKEEPER-335:


I am having this exact issue, but I am not upgrading. I am merely restarting 
the cluster.

I have a cluster of three. I took down host1 and verified that my application 
remained and reconnected to host2 and host3.

With host1 back online, I took down host2. I noticed that the java process was 
spinning over 100% CPU and realized it had not come back up.

This is running the 3.3.0 JAR release on a dual proc, quad-core Intel box. I'm 
running SuSE 10.3, 64-bit, with this version of java:

java version 1.6.0_10
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode)

I will attach a log file.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-17 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880001#action_12880001
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


Thanks for the log Mike. This issue does seem similar to what Charity reported:

2010-06-17 14:35:34,263 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 1 is less than our epoch 2

Unfortunately the attached log shows information only after the problem 
occurred. Any chance you could upload the logs during the initial event? (what 
I mean is when the problem originally started) Also the logs from the other 
servers in the ensemble (again, at the time that the problem originally 
occurred) would really help. Thanks.

Have you been able to clear the problem? It's fairly straightforward to resolve 
- Charity resolved by; 1) bring down the failing server, 2) clear the data 
directory of that server (only), 3) start that server. You only want to do this 
for the server that's unable to rejoin the quorum - ie the one thats outputting 
Leader epoch 1 is less than our epoch 2, _not_ for all servers in the 
ensemble.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-16 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879520#action_12879520
 ] 

Vishal K commented on ZOOKEEPER-335:


Hi,

We are running into this bug very often (almost 60-75% hit rate) while testing 
our newly developed application over ZK.
This is almost a blocker for us. Will the fix be simplified if backward 
compatibility was not an issue?

Thanks.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-16 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879548#action_12879548
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


We are unable to reproduce this issue. If you can provide the server logs (all 
servers) and attach them to this jira it would be very helpful. Some detail on 
the approx time of the issue so we can correlate to the logs would help too 
(summary of what you did/do to cause it, etc... anything that might help us 
nail this one down).

Detail on ZK version, OS, Java version, HW info, etc... would also be of use to 
us. 

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-02 Thread Charity Majors (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874888#action_12874888
 ] 

Charity Majors commented on ZOOKEEPER-335:
--

I ran into this bug this morning, but it also seemed to put my cluster into an 
unusable state.  The cluster stopped accepting all connections, until I 
restarted node one.  After node one departed the cluster, nodes two and three 
formed a quorum and started serving again.  Node one was unable to rejoin, and 
had this error:

2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch a is less than our epoch b
2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader
java.io.IOException: Error: Epoch of leader is lower

until I cleared the data directory and restarted again.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.