[jira] Commented: (ZOOKEEPER-712) Bookie recovery

2010-06-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880785#action_12880785
 ] 

Hadoop QA commented on ZOOKEEPER-712:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12447283/ZOOKEEPER-712.patch
  against trunk revision 953041.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/118/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/118/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/118/console

This message is automatically generated.

 Bookie recovery
 ---

 Key: ZOOKEEPER-712
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-712
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib-bookkeeper
Reporter: Flavio Paiva Junqueira
Assignee: Erwin Tam
 Attachments: ZOOKEEPER-712.patch


 Recover the ledger fragments of a bookie once it crashes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880917#action_12880917
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


vishal comment on list:


I might be wrong here, but let me try to chip in my few cents.

I think the problem is in LearnerHandler.java at the leader fo this
Follower.

/* see what other packets from the proposal
 * and tobeapplied queues need to be sent
 * and then decide if we can just send a DIFF
 * or we actually need to send the whole snapshot
 */
long leaderLastZxid = leader.startForwarding(this, updates);
--- this leaderLastZxid returned is probably incorrect.
// a special case when both the ids are the same
if (peerLastZxid == leaderLastZxid) {
packetToSend = Leader.DIFF;
zxidToSend = leaderLastZxid;
}

QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
leaderLastZxid, null, null);
oa.writeRecord(newLeaderQP, packet);
bufferedOutput.flush()


 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880918#action_12880918
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


vishal comment on list:


Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that
the follower received the epoch before restart.


 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt
Please use the JIRA for followups, otw it's hard to track 
progress/status. thanks.


Patrick

On 06/18/2010 04:45 PM, Vishal K wrote:

Hi Flavio,

I have 3 set of logs and they all seem to indicate two problems on the
misbehaving follower:

Problem 1: Expected zxid is incorrect
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1

Note expected zxid is always 0x1 (lastQueued is always 0?)

Problem 2: While joining the cluster expected epoch is 1 higher than seen
earlier
=14991 [QuorumPeer:/0.0.0.0:2181] FATAL
org.apache.zookeeper.server.quorum.Learner  - Leader epoch 7 is less than
our epoch 8

-Vishal


[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880919#action_12880919
 ] 

Patrick Hunt commented on ZOOKEEPER-335:


Vishal comment on list:


Hi Flavio,

I have 3 set of logs and they all seem to indicate two problems on the
misbehaving follower:

Problem 1: Expected zxid is incorrect
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x30002 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=2495 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x40001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=191617 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x50001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=0[QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x60001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1
=245016 [QuorumPeer:/0.0.0.0:2181] WARN
org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x70001 expected
0x1

Note expected zxid is always 0x1 (lastQueued is always 0?)

Problem 2: While joining the cluster expected epoch is 1 higher than seen
earlier
=14991 [QuorumPeer:/0.0.0.0:2181] FATAL
org.apache.zookeeper.server.quorum.Learner  - Leader epoch 7 is less than
our epoch 8

-Vishal


 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881028#action_12881028
 ] 

Vishal K commented on ZOOKEEPER-335:


Hi,

I enabled tracing and did some more debugging. Looks like the restarted peer 
(and trying to join the cluster) determines that it is a leader and increments 
its epoch. However, rest of the nodes don't acknowledge this node as the 
leader, and hence, have an older epoch. I will attache the log. Unfortunately, 
I don't have traces from other nodes. I will repeat the experiment later and 
attache logs from other nodes. 

Scenario:
- Form a 3 node cluster. This is not just ZK cluster. It also involves our 
application cluster that uses ZK.
- Kill one of the follower
- After a minute or so restart follower
- Follower rejects leader with Leader epoch y is less than our epoch y + 1

From logs:

a) Peer X restarts and starts leader election.
a) For a small window of time, X thinks that it is the new leader! During this 
window, for some reason, rest of the nodes tell X that they are also trying to 
find a leader. I.e., all 3 nodes are in LOOKING state. After seeing that all 3 
nodes are in LOOKING state, X decides to be a leader?

   155 2010-06-20 23:22:46,421 - DEBUG [WorkerSender 
Thread:quorumcnxmana...@346] - Opening channel to server 1
   156 2010-06-20 23:22:46,423 - DEBUG [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new 
notification message. My id = 0
   157 2010-06-20 23:22:46,424 - INFO  
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 0, 
77309411393, 1, 0, LOOKING, LOOKING, 0
   158 2010-06-20 23:22:46,424 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 0, proposed id: 0, 
zxid: 77309411393, proposed zxid: 77309411393
   159 2010-06-20 23:22:46,424 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 0, 
Proposed leader = 0, Porposed zxid = 77309411393, Proposed epoch = 1
   160 2010-06-20 23:22:46,426 - INFO  [WorkerSender 
Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the 
connection: (1, 0)
   161 2010-06-20 23:22:46,426 - DEBUG [WorkerSender 
Thread:quorumcnxmana...@346] - Opening channel to server 2
   162 2010-06-20 23:22:46,427 - DEBUG [Thread-1:quorumcnxmanager$liste...@445] 
- Connection request /192.168.1.182:46701
   163 2010-06-20 23:22:46,427 - DEBUG [Thread-1:quorumcnxmanager$liste...@448] 
- Connection request: 0
   164 2010-06-20 23:22:46,428 - DEBUG 
[Thread-1:quorumcnxmanager$sendwor...@504] - Address of remote peer: 1
   165 2010-06-20 23:22:46,428 - INFO  [WorkerSender 
Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the 
connection: (2, 0)
   166 2010-06-20 23:22:46,431 - DEBUG [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new 
notification message. My id = 0
   167 2010-06-20 23:22:46,432 - INFO  
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 1, 
77309411372, 1, 0, LOOKING, LOOKING, 1
   168 2010-06-20 23:22:46,432 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 1, proposed id: 0, 
zxid: 77309411372, proposed zxid: 77309411393
   169 2010-06-20 23:22:46,432 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 1, 
Proposed leader = 1, Porposed zxid = 77309411372, Proposed epoch = 1
   170 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$liste...@445] 
- Connection request /192.168.1.183:44310
   171 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$liste...@448] 
- Connection request: 0
   172 2010-06-20 23:22:46,436 - DEBUG 
[Thread-1:quorumcnxmanager$sendwor...@504] - Address of remote peer: 2
   173 2010-06-20 23:22:46,440 - DEBUG [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new 
notification message. My id = 0
   174 2010-06-20 23:22:46,440 - INFO  
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 2, 
7301097, 1, 0, LOOKING, LOOKING, 2
   175 2010-06-20 23:22:46,440 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 2, proposed id: 0, 
zxid: 7301097, proposed zxid: 77309411393
   176 2010-06-20 23:22:46,441 - DEBUG 
[QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 2, 
Proposed leader = 2, Porposed zxid = 7301097, Proposed epoch = 1
   177 2010-06-20 23:22:46,441 - INFO  
[QuorumPeer:/0.0.0.0:2181:quorump...@647] - LEADING

b) As a result X  increments its epoch. Worse, since this node decided to be a 
leader, it starts doing transactions. The first set of transactions start 
removing all ephemeral nodes. But these transactions are only done locally. 
Other peers do not ack these transactions since they know that this peer is not 
the leader.

c) After a few seconds (8 secs), X relinquishes leadership since it does not 
receive any ack from rest of 

[jira] Updated: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-06-21 Thread Vishal K (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal K updated ZOOKEEPER-335:
---

Attachment: faultynode-vishal.txt

Apologies for multiple attachments.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.