[jira] [Commented] (ZOOKEEPER-1448) Node+Quota creation in transaction log can crash leader startup

2012-04-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255749#comment-13255749
 ] 

Camille Fournier commented on ZOOKEEPER-1448:
-

Good catch. Can you provide a patch for this?

 Node+Quota creation in transaction log can crash leader startup
 ---

 Key: ZOOKEEPER-1448
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1448
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.5
Reporter: Botond Hejj
 Fix For: 3.3.6


 Hi,
 I've found a bug in zookeeper related to quota creation which can shutdown 
 zookeeper leader on startup.
 Steps to reproduce:
 1. create /quota_bug
 2. setquota -n 1 /quota_bug
 3. stop the whole ensemble (the previous operations should be in the 
 transaction log)
 4. start all the servers
 5. the elected leader will shutdown with an exception (Missing stat node for 
 count /zookeeper/quota/quota_bug/zookeeper_
 stats)
 I've debugged a bit what happening and I found the following problem:
 On startup each server loads the last snapshot and replays the last 
 transaction log. While doing this it fills up the pTrie variable of the 
 DataTree with the path of the nodes which have quota.
 After the leader is elected the leader servers loads the snapshot and last 
 transaction log but it doesn't clean up the pTrie variable. This means it 
 still contains the /quota_bug path. Now when the create /quota_bug is 
 processed from the transaction log the DataTree already thinks that the quota 
 nodes (/zookeeper/quota/quota_bug/zookeeper_limits and 
 /zookeeper/quota/quota_bug/zookeeper_stats) are created but those node 
 creation actually comes later in the transaction log. This leads to the 
 missing stat node exception.
 I think clearing the pTrie should solve this problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1449) Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)

2012-04-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255836#comment-13255836
 ] 

Camille Fournier commented on ZOOKEEPER-1449:
-

Can you reproduce it with a more recent release? 3.3.3 is a bit old at this 
point and we've fixed a few things between that and 3.3.5.

 Ephemeral znode not deleted after session has expired on one follower (quorum 
 is in an inconsistent state) 
 ---

 Key: ZOOKEEPER-1449
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1449
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Daniel Lord
 Attachments: zk.zip


 I've been running in to this situation in our labs fairly regularly where 
 we'll get a single follower that will be in an inconsistent state with 
 dangling ephemeral znodes.  Here is all of the information that I have right 
 now.  Please ask if there is anything else that is useful.
 Here is a quick snapshot of the state of the ensemble where you can see it is 
 out of sync across several znodes: 
 -bash-3.2$ echo srvr | nc il23n04sa-zk001 2181
 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
 Latency min/avg/max: 0/7/25802
 Received: 64002
 Sent: 63985
 Outstanding: 0
 Zxid: 0x50a41
 Mode: follower
 Node count: 497
 -bash-3.2$ echo srvr | nc il23n04sa-zk002 2181
 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
 Latency min/avg/max: 0/13/79032
 Received: 74320
 Sent: 74276
 Outstanding: 0
 Zxid: 0x50a41
 Mode: leader
 Node count: 493
 -bash-3.2$ echo srvr | nc il23n04sa-zk003 2181
 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
 Latency min/avg/max: 0/2/25234
 Received: 187310
 Sent: 187320
 Outstanding: 0
 Zxid: 0x50a41
 Mode: follower
 Node count: 493
 All of the zxids match up just fine but zk001 has 4 more nodes in its node 
 count than the other two (including the leader..).  When I use a zookeeper 
 client connect to connect directly to zk001 I can see the following znode 
 that should no longer exist: 
 [zk: localhost:2181(CONNECTED) 0] stat 
 /siri/Douroucouli/clients/il23n04sa-app004.siri.apple.com:38096
 cZxid = 0x4001a
 ctime = Mon Apr 16 11:00:47 PDT 2012
 mZxid = 0x4001a
 mtime = Mon Apr 16 11:00:47 PDT 2012
 pZxid = 0x4001a
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x236bc504cb50002
 dataLength = 0
 numChildren = 0
 This node does not exist using the client to connect to either of the other 
 two members of the ensemble.
 I searched through the logs for that session id and it looks like it was 
 established and closed cleanly.  There were several leadership/quorum 
 problems during the course of the session but it looks like it should have 
 been shut down properly.  Neither the session nor the znode show up in the 
 dump on the leader but the problem znode does show up in the dump on 
 zk001.
 2012-04-16 11:00:47,637 - INFO  [CommitProcessor:2:NIOServerCnxn@1580] - 
 Established session 0x236bc504cb50002 with negotiated timeout 15000 for 
 client /17.202.71.201:38971
 2012-04-16 11:20:59,341 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client 
 attempting to renew session 0x236bc504cb50002 at /17.202.71.201:50841
 2012-04-16 11:20:59,342 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580] - Established 
 session 0x236bc504cb50002 with negotiated timeout 15000 for client 
 /17.202.71.201:50841
 2012-04-16 11:21:09,343 - WARN  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - 
 EndOfStreamException: Unable to read additional data from client sessionid 
 0x236bc504cb50002, likely client has closed socket
 2012-04-16 11:21:09,343 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed 
 socket connection for client /17.202.71.201:50841 which had sessionid 
 0x236bc504cb50002
 2012-04-16 11:21:20,352 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1435] - Closed socket 
 connection for client /17.202.71.201:38971 which had sessionid 
 0x236bc504cb50002
 2012-04-16 11:21:22,151 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client 
 attempting to renew session 0x236bc504cb50002 at /17.202.71.201:38166
 2012-04-16 11:21:22,152 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1580] - Established session 
 0x236bc504cb50002 with negotiated timeout 15000 for client 
 /17.202.71.201:38166
 2012-04-16 11:27:17,902 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - 
 Processed session termination for sessionid: 0x236bc504cb50002
 2012-04-16 11:27:17,904 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed 
 socket connection for client /17.202.71.201:38166 which had 

[jira] [Commented] (ZOOKEEPER-1449) Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)

2012-04-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255902#comment-13255902
 ] 

Camille Fournier commented on ZOOKEEPER-1449:
-

Virtualization shouldn't be a problem. It's probably one of those bugs listed 
above, but if not we'll definitely want to track it down.

 Ephemeral znode not deleted after session has expired on one follower (quorum 
 is in an inconsistent state) 
 ---

 Key: ZOOKEEPER-1449
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1449
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Daniel Lord
 Attachments: zk.zip


 I've been running in to this situation in our labs fairly regularly where 
 we'll get a single follower that will be in an inconsistent state with 
 dangling ephemeral znodes.  Here is all of the information that I have right 
 now.  Please ask if there is anything else that is useful.
 Here is a quick snapshot of the state of the ensemble where you can see it is 
 out of sync across several znodes: 
 -bash-3.2$ echo srvr | nc il23n04sa-zk001 2181
 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
 Latency min/avg/max: 0/7/25802
 Received: 64002
 Sent: 63985
 Outstanding: 0
 Zxid: 0x50a41
 Mode: follower
 Node count: 497
 -bash-3.2$ echo srvr | nc il23n04sa-zk002 2181
 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
 Latency min/avg/max: 0/13/79032
 Received: 74320
 Sent: 74276
 Outstanding: 0
 Zxid: 0x50a41
 Mode: leader
 Node count: 493
 -bash-3.2$ echo srvr | nc il23n04sa-zk003 2181
 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
 Latency min/avg/max: 0/2/25234
 Received: 187310
 Sent: 187320
 Outstanding: 0
 Zxid: 0x50a41
 Mode: follower
 Node count: 493
 All of the zxids match up just fine but zk001 has 4 more nodes in its node 
 count than the other two (including the leader..).  When I use a zookeeper 
 client connect to connect directly to zk001 I can see the following znode 
 that should no longer exist: 
 [zk: localhost:2181(CONNECTED) 0] stat 
 /siri/Douroucouli/clients/il23n04sa-app004.siri.apple.com:38096
 cZxid = 0x4001a
 ctime = Mon Apr 16 11:00:47 PDT 2012
 mZxid = 0x4001a
 mtime = Mon Apr 16 11:00:47 PDT 2012
 pZxid = 0x4001a
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x236bc504cb50002
 dataLength = 0
 numChildren = 0
 This node does not exist using the client to connect to either of the other 
 two members of the ensemble.
 I searched through the logs for that session id and it looks like it was 
 established and closed cleanly.  There were several leadership/quorum 
 problems during the course of the session but it looks like it should have 
 been shut down properly.  Neither the session nor the znode show up in the 
 dump on the leader but the problem znode does show up in the dump on 
 zk001.
 2012-04-16 11:00:47,637 - INFO  [CommitProcessor:2:NIOServerCnxn@1580] - 
 Established session 0x236bc504cb50002 with negotiated timeout 15000 for 
 client /17.202.71.201:38971
 2012-04-16 11:20:59,341 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client 
 attempting to renew session 0x236bc504cb50002 at /17.202.71.201:50841
 2012-04-16 11:20:59,342 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580] - Established 
 session 0x236bc504cb50002 with negotiated timeout 15000 for client 
 /17.202.71.201:50841
 2012-04-16 11:21:09,343 - WARN  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - 
 EndOfStreamException: Unable to read additional data from client sessionid 
 0x236bc504cb50002, likely client has closed socket
 2012-04-16 11:21:09,343 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed 
 socket connection for client /17.202.71.201:50841 which had sessionid 
 0x236bc504cb50002
 2012-04-16 11:21:20,352 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1435] - Closed socket 
 connection for client /17.202.71.201:38971 which had sessionid 
 0x236bc504cb50002
 2012-04-16 11:21:22,151 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client 
 attempting to renew session 0x236bc504cb50002 at /17.202.71.201:38166
 2012-04-16 11:21:22,152 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1580] - Established session 
 0x236bc504cb50002 with negotiated timeout 15000 for client 
 /17.202.71.201:38166
 2012-04-16 11:27:17,902 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - 
 Processed session termination for sessionid: 0x236bc504cb50002
 2012-04-16 11:27:17,904 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed 
 socket connection for client /17.202.71.201:38166 which had 

[jira] [Commented] (ZOOKEEPER-1442) Uncaught exception handler should exit on a java.lang.Error

2012-04-13 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253446#comment-13253446
 ] 

Camille Fournier commented on ZOOKEEPER-1442:
-

My biggest question mark is around exiting on ThreadDeath, and I'd like to get 
a bit of community feedback before committing. But if I can get some color 
around those concerns I'm ok with the patch.

 Uncaught exception handler should exit on a java.lang.Error
 ---

 Key: ZOOKEEPER-1442
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1442
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client, server
Affects Versions: 3.4.3, 3.3.5
Reporter: Jeremy Stribling
Assignee: Jeremy Stribling
Priority: Minor
 Attachments: ZOOKEEPER-1442.patch


 The uncaught exception handler registered in NIOServerCnxnFactory and 
 ClientCnxn simply logs exceptions and lets the rest of ZooKeeper go on its 
 merry way.  However, errors such as OutOfMemoryErrors should really crash the 
 program, as they represent unrecoverable errors.  If the exception that gets 
 to the uncaught exception handler is an instanceof a java.lang.Error, ZK 
 should exit with an error code (in addition to logging the error).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1375) SendThread is exiting after OOMError

2012-03-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235102#comment-13235102
 ] 

Camille Fournier commented on ZOOKEEPER-1375:
-

If your client throws an OOM error, there's no guarantee that you will be able 
to do anything at all beyond that point. It's not clear to me what you hope to 
do about it. What are the users going to do when they can't act themselves due 
to the OOM state?

 SendThread is exiting after OOMError
 

 Key: ZOOKEEPER-1375
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1375
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Rakesh R

 After reviewing the ClientCnxn code, there is still chances of exiting the 
 SendThread without intimating the users. Say if client throws OOMError and 
 entered into the throwable block. Here again while sending the Disconnected 
 event, its creating new WatchedEvent() object.This will throw OOMError and 
 leads to exit the SendThread without any Disconnected event notification.
 {noformat}
 try{
 //...
 } catch (Throwable e)
 {
 //..
 cleanup();
if(state.isAlive()){
 eventThread.queueEvent(
 new WatchedEvent(Event.EventType.None, 
 Event.KeeperState.Disconnected, null) )
}
//
 }
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1375) SendThread is exiting after OOMError

2012-03-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235271#comment-13235271
 ] 

Camille Fournier commented on ZOOKEEPER-1375:
-

A server ran out of memory? This ticket is for the client code, not the server 
code. More likely NIOServerCnxn than ClientCnxn as you mention.
 
OOM stuff can cause VMs to behave very strangely, which is why I generally 
think it's best to fail big and fail fast when it happens. There's not really 
any sense in trying to recover because beyond that point the behavior is 
pretty non-deterministic. Strange that the other VMs wouldn't form a quorum 
though... might be interesting to dig into. Feel free to open another ticket 
with some more info and we can dig into it more.

 SendThread is exiting after OOMError
 

 Key: ZOOKEEPER-1375
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1375
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Rakesh R

 After reviewing the ClientCnxn code, there is still chances of exiting the 
 SendThread without intimating the users. Say if client throws OOMError and 
 entered into the throwable block. Here again while sending the Disconnected 
 event, its creating new WatchedEvent() object.This will throw OOMError and 
 leads to exit the SendThread without any Disconnected event notification.
 {noformat}
 try{
 //...
 } catch (Throwable e)
 {
 //..
 cleanup();
if(state.isAlive()){
 eventThread.queueEvent(
 new WatchedEvent(Event.EventType.None, 
 Event.KeeperState.Disconnected, null) )
}
//
 }
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1407) Support GetData and GetChildren in Multi

2012-03-20 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233631#comment-13233631
 ] 

Camille Fournier commented on ZOOKEEPER-1407:
-

Zhihong, it's good if you change the state to Patch Available when you've got 
something for us to look at. We generally look at the patch available queue to 
determine what we need to review, etc. It will also trigger the automated build 
check. I've set this one to patch available.

 Support GetData and GetChildren in Multi
 

 Key: ZOOKEEPER-1407
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1407
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Zhihong Yu
 Fix For: 3.5.0

 Attachments: 1407-v2.txt, 1407.txt


 There is use case where GetData and GetChildren would participate in Multi.
 We should add support for this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads

2012-03-20 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234029#comment-13234029
 ] 

Camille Fournier commented on ZOOKEEPER-1100:
-

3.4.X and trunk, I believe. Are you seeing it in 3.4.X? We did a big refactor 
between 3.3.X and 3.4... I can look for a jira if you're interested.

 Killed (or missing) SendThread will cause hanging threads
 -

 Key: ZOOKEEPER-1100
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
 Environment: 
 http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E
Reporter: Gunnar Wagenknecht
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1100.patch, ZOOKEEPER-1100.patch


 After investigating an issues with [hanging 
 threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E]
  I noticed that any java.lang.Error might silently kill the SendThread. 
 Without a SendThread any thread that wants to send something will hang 
 forever. 
 Currently nobody will recognize a SendThread that died. I think at least a 
 state should be flipped (or flag should be set) that causes all further send 
 attempts to fail or to re-spin the connection loop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1419) Leader election never settles for a 5-node cluster

2012-03-19 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233054#comment-13233054
 ] 

Camille Fournier commented on ZOOKEEPER-1419:
-

I'm gonna check this in to trunk and 3.4 tonight.

 Leader election never settles for a 5-node cluster
 --

 Key: ZOOKEEPER-1419
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1419
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.3, 3.5.0
 Environment: 64-bit Linux, all nodes running on the same machine 
 (different ports)
Reporter: Jeremy Stribling
Assignee: Flavio Junqueira
Priority: Blocker
 Fix For: 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1419-fixed2.tgz, ZOOKEEPER-1419.patch, 
 ZOOKEEPER-1419.patch, ZOOKEEPER-1419.patch


 We have a situation where it seems to my untrained eye that leader election 
 never finishes for a 5-node cluster.  In this test, all nodes are ZK 3.4.3 
 and running on the same server (listening on different ports, of course).  
 The nodes have server IDs of 0, 1, 2, 3, 4.  The test brings up the cluster 
 in different configurations, adding in a new node each time.  We embed ZK in 
 our application, so when we shut a node down and restart it with a new 
 configuration, it all happens in a single JVM process.  Here's our server 
 startup code (for the case where there's more than one node in the cluster):
 {code}
 if (servers.size()  1) {
 _log.debug(Starting Zookeeper server in quorum server mode);
 _quorum_peer = new QuorumPeer();
 synchronized(_quorum_peer) {
 _quorum_peer.setClientPortAddress(clientAddr);
 _quorum_peer.setTxnFactory(log);
 _quorum_peer.setQuorumPeers(servers);
 _quorum_peer.setElectionType(_election_alg);
 _quorum_peer.setMyid(_server_id);
 _quorum_peer.setTickTime(_tick_time);
 _quorum_peer.setInitLimit(_init_limit);
 _quorum_peer.setSyncLimit(_sync_limit);
 QuorumVerifier quorumVerifier =
 new QuorumMaj(servers.size());
 _quorum_peer.setQuorumVerifier(quorumVerifier);
 _quorum_peer.setCnxnFactory(_cnxn_factory);
 _quorum_peer.setZKDatabase(new ZKDatabase(log));
 _quorum_peer.start();
 }
 } else {
 _log.debug(Starting Zookeeper server in single server mode);
 _zk_server = new ZooKeeperServer();
 _zk_server.setTxnLogFactory(log);
 _zk_server.setTickTime(_tick_time);
 _cnxn_factory.startup(_zk_server);
 }
 {code}
 And here's our shutdown code:
 {code}
 if (_quorum_peer != null) {
 synchronized(_quorum_peer) {
 _quorum_peer.shutdown();
 FastLeaderElection fle =
 (FastLeaderElection) _quorum_peer.getElectionAlg();
 fle.shutdown();
 try {
 _quorum_peer.getTxnFactory().commit();
 } catch (java.nio.channels.ClosedChannelException e) {
 // ignore
 }
 }
 } else {
 _cnxn_factory.shutdown();
 _zk_server.getTxnLogFactory().commit();
 }
 {code}
 The test steps through the following scenarios in quick succession:
 Run 1: Start a 1-node cluster, servers=[0]
 Run 2: Start a 2-node cluster, servers=[0,3]
 Run 3: Start a 3-node cluster, servers=[0,1,3]
 Run 4: Start a 4-node cluster, servers=[0,1,2,3]
 Run 5: Start a 5-node cluster, servers=[0,1,2,3,4]
 It appears that run 5 never elects a leader -- the nodes just keep spewing 
 messages like this (example from node 0):
 {noformat}
 2012-03-14 16:23:12,775 13308 [WorkerSender[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection 
 already for server 2
 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - Sending 
 Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 3 (recipient), 0 
 (myid), 0x2 (n.peerEpoch)
 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection 
 already for server 3
 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - Sending 
 Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 4 (recipient), 0 
 (myid), 0x2 (n.peerEpoch)
 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection 
 already for server 4
 2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - Receive new 
 notification message. My id = 0
 2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] INFO 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - 

[jira] [Commented] (ZOOKEEPER-1320) Add the feature to zookeeper allow client limitations by ip.

2012-03-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232081#comment-13232081
 ] 

Camille Fournier commented on ZOOKEEPER-1320:
-

It doesn't look like we agree that this feature is necessary and it's not 
applying cleanly. I'm moving this out of patch available state until you get it 
into more review-ready shape.

 Add the feature to zookeeper allow client limitations by ip.
 

 Key: ZOOKEEPER-1320
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1320
 Project: ZooKeeper
  Issue Type: New Feature
  Components: server
Affects Versions: 3.3.3
 Environment: Linux version 2.6.18-164.el5 (gcc version 4.1.2 20080704 
 (Red Hat 4.1.2-46)), jdk-1.6.0_17
Reporter: Leader Ni
Assignee: Leader Ni
  Labels: client,server,limited,ipfilter
 Attachments: UserGuide-1320-iplimited.docx, 
 UserGuide-1320-iplimited.pdf, ZOOKEEPER-1320-iplimited.patch, 
 zookeeper-3.3.3.jar_iplimited

   Original Estimate: 168h
  Remaining Estimate: 168h

 Add the feature to zookeeper so that administrator can set the list of ips 
 that limit which nodes can connect to the zk servers and which connected 
 clients can operate on data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1377) add support for dumping a snapshot file content (similar to LogFormatter)

2012-03-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232086#comment-13232086
 ] 

Camille Fournier commented on ZOOKEEPER-1377:
-

+1 looks nice. Should we consider adding this to 3.4? I realize it's a new 
feature but it is also an awfully useful utility.

 add support for dumping a snapshot file content (similar to LogFormatter)
 -

 Key: ZOOKEEPER-1377
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1377
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Patrick Hunt
Assignee: Patrick Hunt
  Labels: newbie
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1377.patch, ZOOKEEPER-1377.patch


 We have LogFormatter but not SnapshotFormatter. I've added this, patch 
 momentarily.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1397) Remove BookKeeper documentation links

2012-03-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232091#comment-13232091
 ] 

Camille Fournier commented on ZOOKEEPER-1397:
-

Somehow missed 2 files in the checkin, should be fixed now.

 Remove BookKeeper documentation links
 -

 Key: ZOOKEEPER-1397
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1397
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1397.patch


 BookKeeper is now a subproject and its documentation is maintained in the 
 site of the subproject. Consequently, we should remove the links in the 
 zookeeper documentation pages or at least point to the documentation of the 
 subproject site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1419) Leader election never settles for a 5-node cluster

2012-03-17 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232097#comment-13232097
 ] 

Camille Fournier commented on ZOOKEEPER-1419:
-

I don't see why this is marked to 3.3.5, the logic there does not seem to be 
faulty at a glance.

Do we want to add a test with a 5-member quorum or do we think the unit test on 
the predicate logic is enough?

 Leader election never settles for a 5-node cluster
 --

 Key: ZOOKEEPER-1419
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1419
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.3, 3.3.5, 3.5.0
 Environment: 64-bit Linux, all nodes running on the same machine 
 (different ports)
Reporter: Jeremy Stribling
Assignee: Flavio Junqueira
Priority: Blocker
 Fix For: 3.3.6, 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1419-fixed2.tgz, ZOOKEEPER-1419.patch, 
 ZOOKEEPER-1419.patch, ZOOKEEPER-1419.patch


 We have a situation where it seems to my untrained eye that leader election 
 never finishes for a 5-node cluster.  In this test, all nodes are ZK 3.4.3 
 and running on the same server (listening on different ports, of course).  
 The nodes have server IDs of 0, 1, 2, 3, 4.  The test brings up the cluster 
 in different configurations, adding in a new node each time.  We embed ZK in 
 our application, so when we shut a node down and restart it with a new 
 configuration, it all happens in a single JVM process.  Here's our server 
 startup code (for the case where there's more than one node in the cluster):
 {code}
 if (servers.size()  1) {
 _log.debug(Starting Zookeeper server in quorum server mode);
 _quorum_peer = new QuorumPeer();
 synchronized(_quorum_peer) {
 _quorum_peer.setClientPortAddress(clientAddr);
 _quorum_peer.setTxnFactory(log);
 _quorum_peer.setQuorumPeers(servers);
 _quorum_peer.setElectionType(_election_alg);
 _quorum_peer.setMyid(_server_id);
 _quorum_peer.setTickTime(_tick_time);
 _quorum_peer.setInitLimit(_init_limit);
 _quorum_peer.setSyncLimit(_sync_limit);
 QuorumVerifier quorumVerifier =
 new QuorumMaj(servers.size());
 _quorum_peer.setQuorumVerifier(quorumVerifier);
 _quorum_peer.setCnxnFactory(_cnxn_factory);
 _quorum_peer.setZKDatabase(new ZKDatabase(log));
 _quorum_peer.start();
 }
 } else {
 _log.debug(Starting Zookeeper server in single server mode);
 _zk_server = new ZooKeeperServer();
 _zk_server.setTxnLogFactory(log);
 _zk_server.setTickTime(_tick_time);
 _cnxn_factory.startup(_zk_server);
 }
 {code}
 And here's our shutdown code:
 {code}
 if (_quorum_peer != null) {
 synchronized(_quorum_peer) {
 _quorum_peer.shutdown();
 FastLeaderElection fle =
 (FastLeaderElection) _quorum_peer.getElectionAlg();
 fle.shutdown();
 try {
 _quorum_peer.getTxnFactory().commit();
 } catch (java.nio.channels.ClosedChannelException e) {
 // ignore
 }
 }
 } else {
 _cnxn_factory.shutdown();
 _zk_server.getTxnLogFactory().commit();
 }
 {code}
 The test steps through the following scenarios in quick succession:
 Run 1: Start a 1-node cluster, servers=[0]
 Run 2: Start a 2-node cluster, servers=[0,3]
 Run 3: Start a 3-node cluster, servers=[0,1,3]
 Run 4: Start a 4-node cluster, servers=[0,1,2,3]
 Run 5: Start a 5-node cluster, servers=[0,1,2,3,4]
 It appears that run 5 never elects a leader -- the nodes just keep spewing 
 messages like this (example from node 0):
 {noformat}
 2012-03-14 16:23:12,775 13308 [WorkerSender[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection 
 already for server 2
 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - Sending 
 Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 3 (recipient), 0 
 (myid), 0x2 (n.peerEpoch)
 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection 
 already for server 3
 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - Sending 
 Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 4 (recipient), 0 
 (myid), 0x2 (n.peerEpoch)
 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection 
 already for server 4
 2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] DEBUG 
 org.apache.zookeeper.server.quorum.FastLeaderElection  - 

[jira] [Commented] (ZOOKEEPER-1421) Support for hierarchical ACLs

2012-03-15 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230827#comment-13230827
 ] 

Camille Fournier commented on ZOOKEEPER-1421:
-

This would be very useful, agreed.

 Support for hierarchical ACLs
 -

 Key: ZOOKEEPER-1421
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1421
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Thomas Weise

 Using ZK as a service, we need to restrict access to subtrees owned by 
 different tenants. Currently there is no support for hierarchical ACLs, so it 
 is necessary to configure the clients not only with their parent node, but 
 also manage the ACL for each new node created in the subtree. With support 
 for hierarchical ACLs, duplication could be avoided and the setup of the 
 parent nodes with ACL and subsequent control of the same could be split into 
 a separate administrative task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently

2012-03-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220628#comment-13220628
 ] 

Camille Fournier commented on ZOOKEEPER-1354:
-

This looks good, I'll check it in to trunk.

 AuthTest.testBadAuthThenSendOtherCommands fails intermittently
 --

 Key: ZOOKEEPER-1354
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.4.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1354.patch


 I'm seeing the following intermittent failure:
 {noformat}
 junit.framework.AssertionFailedError: Should have called my watcher 
 expected:1 but was:0
   at 
 org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89)
   at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}
 The following commit introduced this test:
 bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause 
 buffer corruption issues in NIOServer. (camille via breed)
 +Assert.assertEquals(Should have called my watcher,
 +1, authFailed.get());
 I think it's due to either a) the code is not waiting for the
 notification to be propagated, or 2) the message doesn't make it back
 from the server to the client prior to the socket or the clientcnxn
 being closed.
 What do you think, should I just wait for the notification to arrive? or do 
 you think it's 2). ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently

2012-03-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220639#comment-13220639
 ] 

Camille Fournier commented on ZOOKEEPER-1354:
-

Checked in to 3.4.4 and trunk

 AuthTest.testBadAuthThenSendOtherCommands fails intermittently
 --

 Key: ZOOKEEPER-1354
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.4.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.4, 3.5.0

 Attachments: ZOOKEEPER-1354.patch


 I'm seeing the following intermittent failure:
 {noformat}
 junit.framework.AssertionFailedError: Should have called my watcher 
 expected:1 but was:0
   at 
 org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89)
   at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}
 The following commit introduced this test:
 bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause 
 buffer corruption issues in NIOServer. (camille via breed)
 +Assert.assertEquals(Should have called my watcher,
 +1, authFailed.get());
 I think it's due to either a) the code is not waiting for the
 notification to be propagated, or 2) the message doesn't make it back
 from the server to the client prior to the socket or the clientcnxn
 being closed.
 What do you think, should I just wait for the notification to arrive? or do 
 you think it's 2). ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1309) Creating a new ZooKeeper client can leak file handles

2012-02-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216963#comment-13216963
 ] 

Camille Fournier commented on ZOOKEEPER-1309:
-

Ran tests and they all passed, so I'm gonna check this in to 3.3.5.

 Creating a new ZooKeeper client can leak file handles
 -

 Key: ZOOKEEPER-1309
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1309
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.4
Reporter: Daniel Lord
Assignee: Daniel Lord
Priority: Critical
 Fix For: 3.3.5

 Attachments: zk-1309-1.patch, zk-1309-1.patch, zk-1309-1.patch, 
 zk-1309-3.patch


 If there is an IOException thrown by the constructor of ClientCnxn then file 
 handles are leaked because of the initialization of the Selector which is 
 never closed.
 final Selector selector = Selector.open();
 If there is an abnormal exit from the constructor then the Selector is not 
 closed and file handles are leaked.  You can easily see this by setting the 
 hosts string to garbage (qwerty, asdf, etc.) and then try to open a new 
 ZooKeeper connection.  I've observed the same behavior in production when 
 there were DNS issues where the host names of the ensemble can no longer be 
 resolved and the application servers quickly run out of handles attempting to 
 (re)connect to zookeeper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1344) ZooKeeper client multi-update command is not considering the Chroot request

2012-02-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216975#comment-13216975
 ] 

Camille Fournier commented on ZOOKEEPER-1344:
-

I hate to say it but this patch no longer applies. Can you please regenerate so 
that it applies to latest trunk and 3.4 branch if necessary, so we can check it 
in? Thanks.

 ZooKeeper client multi-update command is not considering the Chroot request
 ---

 Key: ZOOKEEPER-1344
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1344
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.0
Reporter: Rakesh R
Assignee: Rakesh R
Priority: Critical
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1344-onlytestcase.patch, 
 ZOOKEEPER-1344.1.patch, ZOOKEEPER-1344.patch


 For example: 
 I have created a ZooKeeper client with subtree as 10.18.52.144:2179/apps/X. 
 Now just generated OP command for the creation of zNode /myId. When the 
 client creates the path /myid, the ZooKeeper server is actually be creating 
 the path as /myid instead of creating as /apps/X/myid
 Expected output: zNode has to be created as /apps/X/myid

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation

2012-02-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216978#comment-13216978
 ] 

Camille Fournier commented on ZOOKEEPER-1361:
-

If we're going to do all these whitespace changes, it's going to make changes 
to both 3.4 and trunk difficult. I am really not fond of changing all the 
whitespace in a file for a simple checkin. Can we get this patch generated 
without the whitespace changes?

 Leader.lead iterates over 'learners' set without proper synchronisation
 ---

 Key: ZOOKEEPER-1361
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.2
Reporter: Henry Robinson
Assignee: Henry Robinson
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1361.patch


 This block:
 {code}
 HashSetLong followerSet = new HashSetLong();
 for(LearnerHandler f : learners)
 followerSet.add(f.getSid());
 {code}
 is executed without holding the lock on learners, so if there were ever a 
 condition where a new learner was added during the initial sync phase, I'm 
 pretty sure we'd see a concurrent modification exception. Certainly other 
 parts of the code are very careful to lock on learners when iterating. 
 It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, 
 but I can't convince myself that this wouldn't introduce some correctness 
 bugs. For example the following:
 Learners contains A, B, C, D
 Thread 1 iterates over learners, and gets as far as B.
 Thread 2 removes A, and adds E.
 Thread 1 continues iterating and sees a learner view of A, B, C, D, E
 This may be a bug if Thread 1 is counting the number of synced followers for 
 a quorum count, since at no point was A, B, C, D, E a correct view of the 
 quorum.
 In practice, I think this is actually ok, because I don't think ZK makes any 
 strong ordering guarantees on learners joining or leaving (so we don't need a 
 strong serialisability guarantee on learners) but I don't think I'll make 
 that change for this patch. Instead I want to clean up the locking protocols 
 on the follower / learner sets - to avoid another easy deadlock like the one 
 we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy 
 and then iterate over the copy rather than iterate over a locked set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation

2012-02-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216990#comment-13216990
 ] 

Camille Fournier commented on ZOOKEEPER-1361:
-

I can't get it to apply to either 3.4 or trunk...

 Leader.lead iterates over 'learners' set without proper synchronisation
 ---

 Key: ZOOKEEPER-1361
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.2
Reporter: Henry Robinson
Assignee: Henry Robinson
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch


 This block:
 {code}
 HashSetLong followerSet = new HashSetLong();
 for(LearnerHandler f : learners)
 followerSet.add(f.getSid());
 {code}
 is executed without holding the lock on learners, so if there were ever a 
 condition where a new learner was added during the initial sync phase, I'm 
 pretty sure we'd see a concurrent modification exception. Certainly other 
 parts of the code are very careful to lock on learners when iterating. 
 It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, 
 but I can't convince myself that this wouldn't introduce some correctness 
 bugs. For example the following:
 Learners contains A, B, C, D
 Thread 1 iterates over learners, and gets as far as B.
 Thread 2 removes A, and adds E.
 Thread 1 continues iterating and sees a learner view of A, B, C, D, E
 This may be a bug if Thread 1 is counting the number of synced followers for 
 a quorum count, since at no point was A, B, C, D, E a correct view of the 
 quorum.
 In practice, I think this is actually ok, because I don't think ZK makes any 
 strong ordering guarantees on learners joining or leaving (so we don't need a 
 strong serialisability guarantee on learners) but I don't think I'll make 
 that change for this patch. Instead I want to clean up the locking protocols 
 on the follower / learner sets - to avoid another easy deadlock like the one 
 we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy 
 and then iterate over the copy rather than iterate over a locked set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation

2012-02-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216998#comment-13216998
 ] 

Camille Fournier commented on ZOOKEEPER-1361:
-

No sorry that was my mistake. Ok this is looking good I will check it in.

 Leader.lead iterates over 'learners' set without proper synchronisation
 ---

 Key: ZOOKEEPER-1361
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.2
Reporter: Henry Robinson
Assignee: Henry Robinson
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch


 This block:
 {code}
 HashSetLong followerSet = new HashSetLong();
 for(LearnerHandler f : learners)
 followerSet.add(f.getSid());
 {code}
 is executed without holding the lock on learners, so if there were ever a 
 condition where a new learner was added during the initial sync phase, I'm 
 pretty sure we'd see a concurrent modification exception. Certainly other 
 parts of the code are very careful to lock on learners when iterating. 
 It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, 
 but I can't convince myself that this wouldn't introduce some correctness 
 bugs. For example the following:
 Learners contains A, B, C, D
 Thread 1 iterates over learners, and gets as far as B.
 Thread 2 removes A, and adds E.
 Thread 1 continues iterating and sees a learner view of A, B, C, D, E
 This may be a bug if Thread 1 is counting the number of synced followers for 
 a quorum count, since at no point was A, B, C, D, E a correct view of the 
 quorum.
 In practice, I think this is actually ok, because I don't think ZK makes any 
 strong ordering guarantees on learners joining or leaving (so we don't need a 
 strong serialisability guarantee on learners) but I don't think I'll make 
 that change for this patch. Instead I want to clean up the locking protocols 
 on the follower / learner sets - to avoid another easy deadlock like the one 
 we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy 
 and then iterate over the copy rather than iterate over a locked set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2012-02-15 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208792#comment-13208792
 ] 

Camille Fournier commented on ZOOKEEPER-1382:
-

This is a lot of change for a fix that seems to be really small. Can you put 
this into reviewboard for more careful review? I'm not sure we will want all 
the logging changes so you might want to go through and trim that stuff up 
before putting it up there. Thanks!

 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.4
Reporter: Neha Narkhede
Assignee: Neha Narkhede
 Attachments: ZOOKEEPER-1382_3.3.4.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*cons* | wc -l
 0
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*cons* | wc -l
 0
 So, what seems to have happened is that the client was able to re-register 
 the watches on the new server (226), after it got disconnected from 223, 
 inspite of having an expired session id. 
 In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
 server removes the cnxn and its watches from its internal data structures. 
 But before that it allows more requests to be processed even if the session 
 is expired -
 // Now that the session is ready we can start receiving packets
 synchronized 

[jira] [Commented] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug

2012-02-14 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207891#comment-13207891
 ] 

Camille Fournier commented on ZOOKEEPER-1390:
-

Ok, I'm pretty flexible on it. Added it also as a 3.4.X issue since it's 
present there as well.

 some expensive debug code not protected by a check for debug
 

 Key: ZOOKEEPER-1390
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Benjamin Reed
 Fix For: 3.5.0, 3.4.4

 Attachments: ZOOKEEPER-1390.patch


 there is some expensive debug code in DataTree.processTxn() that formats 
 transactions for debugging that are very expensive but are only used when 
 errors happen and when debugging is turned on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug

2012-02-10 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205508#comment-13205508
 ] 

Camille Fournier commented on ZOOKEEPER-1390:
-

Do you think we might want to leave in those more descriptive debug strings but 
guarded by an if (LOG.isDebugEnabled())? I don't care either way but it might 
be useful.

Otherwise this looks good to me, good catch.

 some expensive debug code not protected by a check for debug
 

 Key: ZOOKEEPER-1390
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Benjamin Reed
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1390.patch


 there is some expensive debug code in DataTree.processTxn() that formats 
 transactions for debugging that are very expensive but are only used when 
 errors happen and when debugging is turned on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2012-02-10 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205864#comment-13205864
 ] 

Camille Fournier commented on ZOOKEEPER-1321:
-

Great. I'm going to check this in now.

 Add number of client connections metric in JMX and srvr
 ---

 Key: ZOOKEEPER-1321
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4, 3.4.2
Reporter: Neha Narkhede
Assignee: Neha Narkhede
  Labels: patch
 Attachments: ZK-1321-nowhitespace.patch, ZOOKEEPER-1321_3.4.patch, 
 ZOOKEEPER-1321_trunk.patch, ZOOKEEPER-1321_trunk.patch, zk-1321-cleanup, 
 zk-1321-trunk.patch, zk-1321.patch, zookeeper-1321-trunk-v2.patch


 The related conversation on the zookeeper user mailing list is here - 
 http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
 It is useful to be able to monitor the number of disconnect operations on a 
 client. This is generally indicative of a client going through large number 
 of GC and hence disconnecting way too often from a zookeeper cluster. 
 Today, this information is only indirectly exposed as part of the stat 
 command which requires counting the results. That's alot of work for the 
 server to do just to get connection count. 
 For monitoring purposes, it will be useful to have this exposed through JMX 
 and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1383) Create update throughput quotas and add hard quota limits

2012-02-10 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205964#comment-13205964
 ] 

Camille Fournier commented on ZOOKEEPER-1383:
-

So, in short, I'm -1 on this until it stops breaking backwards compatibility. 
Might consider adding the update throughput quotas separately from hard quota 
limits.

 Create update throughput quotas and add hard quota limits
 -

 Key: ZOOKEEPER-1383
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1383
 Project: ZooKeeper
  Issue Type: New Feature
  Components: server
Reporter: Jay Shrauner
Assignee: Jay Shrauner
Priority: Minor
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1383.patch, ZOOKEEPER-1383.patch


 Quotas exist for size (node count and size in bytes); it would be useful to 
 track and support quotas on update throughput (bytes per second) as well. 
 This can be tracked on both a node/subtree level for quota support as well as 
 on the server level for monitoring.
 In addition, the existing quotas log a warning when they are exceeded but 
 allow the transaction to proceed (soft quotas). It would also be useful to 
 support a corresponding set of hard quota limits that fail the transaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1383) Create update throughput quotas and add hard quota limits

2012-02-07 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202969#comment-13202969
 ] 

Camille Fournier commented on ZOOKEEPER-1383:
-

This change is definitely going to break backwards compatibility of clients in 
a major way. I'm not sure that it can go into a 3.X release unless we can make 
it not break backwards compatibility.

 Create update throughput quotas and add hard quota limits
 -

 Key: ZOOKEEPER-1383
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1383
 Project: ZooKeeper
  Issue Type: New Feature
  Components: server
Reporter: Jay Shrauner
Assignee: Jay Shrauner
Priority: Minor
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1383.patch, ZOOKEEPER-1383.patch


 Quotas exist for size (node count and size in bytes); it would be useful to 
 track and support quotas on update throughput (bytes per second) as well. 
 This can be tracked on both a node/subtree level for quota support as well as 
 on the server level for monitoring.
 In addition, the existing quotas log a warning when they are exceeded but 
 allow the transaction to proceed (soft quotas). It would also be useful to 
 support a corresponding set of hard quota limits that fail the transaction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-30 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196168#comment-13196168
 ] 

Camille Fournier commented on ZOOKEEPER-1367:
-

Are we not seeing it in 3.3? It seems to me glancing at the code that we should 
also be vulnerable to this there.

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Assignee: Benjamin Reed
Priority: Blocker
 Fix For: 3.4.3

 Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, 
 ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-27 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195218#comment-13195218
 ] 

Camille Fournier commented on ZOOKEEPER-1367:
-

{quote}
On run8.log from 90.0.0.2, we can see that it adds the session 
(1e3516a4bb77) to the sessions list (see FileTxnSnapLog), and it got it 
from its own transaction log. But, the leader (90.0.0.1) supposedly knows of 
that session as well, otherwise it was not committed or leader election didn't 
select the right server. Checking the leader election notification messages, I 
can't see any problem. The part about the leader being aware of that session so 
that it can recreate it is the one we can't verify because we don't have 
run8.log for 90.0.0.1.
{quote}

Server 1 (90.0.0.1) is not the leader at the time that session is created, 
server 2 is the leader. Server 1 is not even in the quorum at that point, it's 
just after 2 has gained leadership with 3 as follower.

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-23 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191396#comment-13191396
 ] 

Camille Fournier commented on ZOOKEEPER-1366:
-

@Henry: I am fine with doing it as a separate ticket. I do think it's pretty 
trivial to rework this and get ourselves far down the road with a non-static 
impl, and not sure that we need to address Thread.sleep() to get a lot of 
mileage out of the solution. But I don't think I'll have time to rework this 
patch to do that so might as well do it in a separate ticket if Ted doesn't 
want to worry about that.

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
Assignee: Ted Dunning
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-22 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190832#comment-13190832
 ] 

Camille Fournier commented on ZOOKEEPER-1366:
-

The test in TimerTest is missing the @Test annotation which I presume is an 
oversight.

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2012-01-22 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190837#comment-13190837
 ] 

Camille Fournier commented on ZOOKEEPER-1366:
-

So in general, I think this is a good patch and a very good thing for us to do. 
But I feel like Henry's comment is most interesting:

{quote}
The nice thing is that this is a small step towards a properly mockable time 
API in ZK, which would a) make tests much faster and b) make tests much more 
deterministic. There's a way to go still because of Thread.sleep and other 
complications, but this is a good first step.
{quote}

We really aren't doing all that much towards that end by replacing one static 
method call with another. You still can't mock that in mockito. So the only 
question I have here is, if we're going to touch all those places anyway, 
should we just be creating an actual thin object that wraps time and use 
non-static methods on that object to make these calls, in order to allow more 
mocking of timing issues in the future? Or should we save that for another 
patch?

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190518#comment-13190518
 ] 

Camille Fournier commented on ZOOKEEPER-1367:
-

Jeremy pretty much always brings us good bugs, Ted, I don't think he's wasting 
our time.

Jeremy, these logs are from the point at which the cluster is running with two 
members and 221 doesn't have the nodes, but 222 does, correct?

I'm noticing that in the log files I don't see a close session transaction for 
the session that created /election/zkrsm/nominee10. Just verifying, the 
cluster is accepting write requests and client connections successfully at the 
point you captured these logs right? 

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190528#comment-13190528
 ] 

Camille Fournier commented on ZOOKEEPER-1367:
-

So I pulled up a cluster on my local machine using these logs, and the two 
machines in my cluster did expire correctly all the ephemeral nodes you show in 
the errors. I'm going to assume that when you bring up a 2 node cluster with 
your setup and these data directories, you see the bad ephemeral nodes, 
correct? If so, can you try doing it with the latest 3.4.2 jar and see if it 
still happens?

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190566#comment-13190566
 ] 

Camille Fournier commented on ZOOKEEPER-1367:
-

Hmmm I must be confused. I thought that the test you were running resulted in 
these the cluster in this setup, with the two nodes running and a third down, 
with these data directories. But if I start the cluster with two nodes and 
these data directories, the sessions immediately expire and delete those nodes. 
On the other hand, in the logs I don't see any evidence of session expiration 
for the sessions holding the ephemerals on either machine. When you get into 
this situation, if you bounce the cluster again with the two nodes does it fix 
the problem? 

I don't know if there's anything in 3.4.2 without checking but it seems like a 
worthwhile sanity check to do.  

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart

2012-01-20 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190066#comment-13190066
 ] 

Camille Fournier commented on ZOOKEEPER-1367:
-

I'll take a look this weekend unless someone's on it now.

 Data inconsistencies and unexpired ephemeral nodes after cluster restart
 

 Key: ZOOKEEPER-1367
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.2
 Environment: Debian Squeeze, 64-bit
Reporter: Jeremy Stribling
Priority: Blocker
 Fix For: 3.4.3

 Attachments: ZOOKEEPER-1367.tgz


 In one of our tests, we have a cluster of three ZooKeeper servers.  We kill 
 all three, and then restart just two of them.  Sometimes we notice that on 
 one of the restarted servers, ephemeral nodes from previous sessions do not 
 get deleted, while on the other server they do.  We are effectively running 
 3.4.2, though technically we are running 3.4.1 with the patch manually 
 applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for 
 ZOOKEEPER-1163.
 I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, 
 zkid 84), I saw only one znode in a particular path:
 {quote}
 [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
 [nominee11]
 [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 {quote}
 However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), 
 I saw three znodes under that same path:
 {quote}
 [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
 nominee06   nominee10   nominee11
 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11
 90.0.0.222: 
 cZxid = 0x40027
 ctime = Thu Jan 19 08:18:24 UTC 2012
 mZxid = 0x40027
 mtime = Thu Jan 19 08:18:24 UTC 2012
 pZxid = 0x40027
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc220001
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10
 90.0.0.221: 
 cZxid = 0x3014c
 ctime = Thu Jan 19 07:53:42 UTC 2012
 mZxid = 0x3014c
 mtime = Thu Jan 19 07:53:42 UTC 2012
 pZxid = 0x3014c
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0xa234f4f3bc22
 dataLength = 16
 numChildren = 0
 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06
 90.0.0.223: 
 cZxid = 0x20cab
 ctime = Thu Jan 19 08:00:30 UTC 2012
 mZxid = 0x20cab
 mtime = Thu Jan 19 08:00:30 UTC 2012
 pZxid = 0x20cab
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x5434f5074e040002
 dataLength = 16
 numChildren = 0
 {quote}
 These never went away for the lifetime of the server, for any clients 
 connected directly to that server.  Note that this cluster is configured to 
 have all three servers still, the third one being down (90.0.0.223, zkid 162).
 I captured the data/snapshot directories for the the two live servers.  When 
 I start single-node servers using each directory, I can briefly see that the 
 inconsistent data is present in those logs, though the ephemeral nodes seem 
 to get (correctly) cleaned up pretty soon after I start the server.
 I will upload a tar containing the debug logs and data directories from the 
 failure.  I think we can reproduce it regularly if you need more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2012-01-16 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186975#comment-13186975
 ] 

Camille Fournier commented on ZOOKEEPER-1321:
-

If one of the other committers wants take a quick glance at the cleanup patch 
that would be great, I can then check it in with your ok.

 Add number of client connections metric in JMX and srvr
 ---

 Key: ZOOKEEPER-1321
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4, 3.4.2
Reporter: Neha Narkhede
Assignee: Neha Narkhede
  Labels: patch
 Attachments: ZOOKEEPER-1321_3.4.patch, ZOOKEEPER-1321_trunk.patch, 
 ZOOKEEPER-1321_trunk.patch, zk-1321-cleanup, zookeeper-1321-trunk-v2.patch


 The related conversation on the zookeeper user mailing list is here - 
 http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
 It is useful to be able to monitor the number of disconnect operations on a 
 client. This is generally indicative of a client going through large number 
 of GC and hence disconnecting way too often from a zookeeper cluster. 
 Today, this information is only indirectly exposed as part of the stat 
 command which requires counting the results. That's alot of work for the 
 server to do just to get connection count. 
 For monitoring purposes, it will be useful to have this exposed through JMX 
 and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1358) In StaticHostProviderTest.java, testNextDoesNotSleepForZero tests that hostProvider.next(0) doesn't sleep by checking that the latency of this call is less than 10s

2012-01-15 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186650#comment-13186650
 ] 

Camille Fournier commented on ZOOKEEPER-1358:
-

This looks good to me. I will check it in.

 In StaticHostProviderTest.java, testNextDoesNotSleepForZero tests that 
 hostProvider.next(0) doesn't sleep by checking that the latency of this call 
 is less than 10sec
 --

 Key: ZOOKEEPER-1358
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1358
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Alexander Shraer
Assignee: Alexander Shraer
Priority: Trivial
 Fix For: 3.2.3

 Attachments: ZOOKEEPER-1358.patch, ZOOKEEPER-1358.patch


 should check for something smaller, perhaps 1ms or 5ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1351) invalid test verification in MultiTransactionTest

2012-01-15 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186661#comment-13186661
 ] 

Camille Fournier commented on ZOOKEEPER-1351:
-

Looks good to me. Will check this in.

 invalid test verification in MultiTransactionTest
 -

 Key: ZOOKEEPER-1351
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1351
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.4.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.3, 3.5.0

 Attachments: ZOOKEEPER-1351.patch, ZOOKEEPER-1351_br34.patch


 tests such as 
 org.apache.zookeeper.test.MultiTransactionTest.testWatchesTriggered() are 
 incorrect. Two issues I see
 1) zk.sync is async, there is no guarantee that the watcher will be called 
 subsequent to sync returning
 {noformat}
 zk.sync(/, null, null);
 assertTrue(watcher.triggered); /// incorrect assumption
 {noformat}
 The callback needs to be implemented, only once the callback is called can we 
 verify the trigger.
 2) trigger is not declared as volatile, even though it will be set in the 
 context of a different thread (eventthread)
 See 
 https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-solaris/91/testReport/junit/org.apache.zookeeper.test/MultiTransactionTest/testWatchesTriggered/
 for an example of a false positive failure
 {noformat}
 junit.framework.AssertionFailedError
   at 
 org.apache.zookeeper.test.MultiTransactionTest.testWatchesTriggered(MultiTransactionTest.java:236)
   at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1183) Enhance LogFormatter to output additional detail from transaction log

2012-01-08 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182344#comment-13182344
 ] 

Camille Fournier commented on ZOOKEEPER-1183:
-

Honestly, I think you're getting a bit ambitious for this ticket. I think you 
should simply enhance the logformatter to a degree that makes sense, and any 
additional tooling either make a new ticket or perhaps a github project for the 
work. 

 Enhance LogFormatter to output additional detail from transaction log
 -

 Key: ZOOKEEPER-1183
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1183
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.4.0
Reporter: kishore gopalakrishna
Assignee: kishore gopalakrishna
Priority: Minor
 Attachments: ZOOKEEPER-1183.patch


 Current LogFormatter prints the following information
 ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
 8/15/11 1:55:36 PM PDT session 0x131cf1a236f0014 cxid 0x0 zxid 0xf01 
 createSession
 8/15/11 1:55:57 PM PDT session 0x131cf1a236f cxid 0x55f zxid 0xf02 setData
 8/15/11 1:56:00 PM PDT session 0x131cf1a236f0015 cxid 0x0 zxid 0xf03 
 createSession
 ...
 ..
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001c cxid 0x36 zxid 0xf6b setData
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f0021 cxid 0xa1 zxid 0xf6c create
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001b cxid 0x3e zxid 0xf6d setData
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001e cxid 0x3e zxid 0xf6e setData
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001d cxid 0x41 zxid 0xf6f setData
 Though this is good information, it does not provide additional information 
 like 
 createSession: which ip created the session and its time out
 set|get|delete: the path and data 
 create: path created and createmode along with data
 We can add additional parameter -detail and provide detailed output of the 
 transaction.
 Outputting data is slightly tricky since we cant print data without 
 understanding the format. We need not print this for now. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently

2012-01-07 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182112#comment-13182112
 ] 

Camille Fournier commented on ZOOKEEPER-1354:
-

Hmmm. Let me take a look.

 AuthTest.testBadAuthThenSendOtherCommands fails intermittently
 --

 Key: ZOOKEEPER-1354
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.4.0
Reporter: Patrick Hunt
 Fix For: 3.4.3, 3.5.0


 I'm seeing the following intermittent failure:
 {noformat}
 junit.framework.AssertionFailedError: Should have called my watcher 
 expected:1 but was:0
   at 
 org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89)
   at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}
 The following commit introduced this test:
 bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause 
 buffer corruption issues in NIOServer. (camille via breed)
 +Assert.assertEquals(Should have called my watcher,
 +1, authFailed.get());
 I think it's due to either a) the code is not waiting for the
 notification to be propagated, or 2) the message doesn't make it back
 from the server to the client prior to the socket or the clientcnxn
 being closed.
 What do you think, should I just wait for the notification to arrive? or do 
 you think it's 2). ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently

2012-01-07 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182118#comment-13182118
 ] 

Camille Fournier commented on ZOOKEEPER-1354:
-

You're getting the AuthFailed exception, the watcher code just didn't execute 
fast enough, so I think it's 1. 

 AuthTest.testBadAuthThenSendOtherCommands fails intermittently
 --

 Key: ZOOKEEPER-1354
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.4.0
Reporter: Patrick Hunt
 Fix For: 3.4.3, 3.5.0


 I'm seeing the following intermittent failure:
 {noformat}
 junit.framework.AssertionFailedError: Should have called my watcher 
 expected:1 but was:0
   at 
 org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89)
   at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}
 The following commit introduced this test:
 bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause 
 buffer corruption issues in NIOServer. (camille via breed)
 +Assert.assertEquals(Should have called my watcher,
 +1, authFailed.get());
 I think it's due to either a) the code is not waiting for the
 notification to be propagated, or 2) the message doesn't make it back
 from the server to the client prior to the socket or the clientcnxn
 being closed.
 What do you think, should I just wait for the notification to arrive? or do 
 you think it's 2). ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1294) One of the zookeeper server is not accepting any requests

2012-01-07 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182125#comment-13182125
 ] 

Camille Fournier commented on ZOOKEEPER-1294:
-

Glancing at the code, I think you might be right. Are you planning on writing a 
test and a fix for this or should I?

 One of the zookeeper server is not accepting any requests
 -

 Key: ZOOKEEPER-1294
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1294
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
 Environment: 3 Zookeeper + 3 Observer with SuSe-11
Reporter: amith
Assignee: kavita sharma

 In zoo.cfg i have configured as
 server.1 = XX.XX.XX.XX:65175:65173
 server.2 = XX.XX.XX.XX:65185:65183
 server.3 = XX.XX.XX.XX:65195:65193
 server.4 = XX.XX.XX.XX:65205:65203:observer
 server.5 = XX.XX.XX.XX:65215:65213:observer
 server.6 = XX.XX.XX.XX:65225:65223:observer
 Like above I have configured 3 PARTICIPANTS and 3 OBSERVERS
 in the cluster of 6 zookeepers
 Steps to reproduce the defect
 1. Start all the 3 participant zookeeper
 2. Stop all the participant zookeeper
 3. Start zookeeper 1(Participant)
 4. Start zookeeper 2(Participant)
 5. Start zookeeper 4(Observer)
 6. Create a persistent node with external client and close it
 7. Stop the zookeeper 1(Participant neo quorum is unstable)
 8. Create a new client and try to find the node created b4 using exists api 
 (will fail since quorum not statisfied)
 9. Start the Zookeeper 1 (Participant stabilise the quorum)
 Now check the observer using 4 letter word (Server.4)
 linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
 | netcat localhost 65200
 Zookeeper version: 3.3.2-1031432, built on 11/05/2010 05:32 GMT
 Clients:
  /127.0.0.1:46370[0](queued=0,recved=1,sent=0)
 Latency min/avg/max: 0/0/0
 Received: 1
 Sent: 0
 Outstanding: 0
 Zxid: 0x10003
 Mode: observer
 Node count: 5
 check the participant 2 with 4 letter word
 Latency min/avg/max: 22/48/83
 Received: 39
 Sent: 3
 Outstanding: 35
 Zxid: 0x10003
 Mode: leader
 Node count: 5
 linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin #
 check the participant 1 with 4 letter word
 linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
 | netcat localhost 65170
 This ZooKeeper instance is not currently serving requests
 We can see the participant1 logs filled with
 2011-11-08 15:49:51,360 - WARN  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:65170:NIOServerCnxn@642] - Exception 
 causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
 running
 Problem here is participent1 is not responding / accepting any requests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1183) Enhance LogFormatter to output additional detail from transaction log

2012-01-07 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182133#comment-13182133
 ] 

Camille Fournier commented on ZOOKEEPER-1183:
-

Kishore, are you still interested in working on this? I'm thinking of enhancing 
the LogFormatter a bit more cleanly, debating whether to work on your patch or 
start from scratch.

 Enhance LogFormatter to output additional detail from transaction log
 -

 Key: ZOOKEEPER-1183
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1183
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.4.0
Reporter: kishore gopalakrishna
Assignee: kishore gopalakrishna
Priority: Minor
 Attachments: ZOOKEEPER-1183.patch


 Current LogFormatter prints the following information
 ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
 8/15/11 1:55:36 PM PDT session 0x131cf1a236f0014 cxid 0x0 zxid 0xf01 
 createSession
 8/15/11 1:55:57 PM PDT session 0x131cf1a236f cxid 0x55f zxid 0xf02 setData
 8/15/11 1:56:00 PM PDT session 0x131cf1a236f0015 cxid 0x0 zxid 0xf03 
 createSession
 ...
 ..
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001c cxid 0x36 zxid 0xf6b setData
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f0021 cxid 0xa1 zxid 0xf6c create
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001b cxid 0x3e zxid 0xf6d setData
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001e cxid 0x3e zxid 0xf6e setData
 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001d cxid 0x41 zxid 0xf6f setData
 Though this is good information, it does not provide additional information 
 like 
 createSession: which ip created the session and its time out
 set|get|delete: the path and data 
 create: path created and createmode along with data
 We can add additional parameter -detail and provide detailed output of the 
 transaction.
 Outputting data is slightly tricky since we cant print data without 
 understanding the format. We need not print this for now. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2011-12-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176673#comment-13176673
 ] 

Camille Fournier commented on ZOOKEEPER-1321:
-

Looks good modulo an unneeded import in ServerCnxnFactory. I will check this in.

 Add number of client connections metric in JMX and srvr
 ---

 Key: ZOOKEEPER-1321
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4, 3.4.2
Reporter: Neha Narkhede
Assignee: Neha Narkhede
  Labels: patch
 Attachments: ZOOKEEPER-1321_trunk.patch


 The related conversation on the zookeeper user mailing list is here - 
 http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
 It is useful to be able to monitor the number of disconnect operations on a 
 client. This is generally indicative of a client going through large number 
 of GC and hence disconnecting way too often from a zookeeper cluster. 
 Today, this information is only indirectly exposed as part of the stat 
 command which requires counting the results. That's alot of work for the 
 server to do just to get connection count. 
 For monitoring purposes, it will be useful to have this exposed through JMX 
 and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2011-12-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176676#comment-13176676
 ] 

Camille Fournier commented on ZOOKEEPER-1321:
-

Neha, if you want this in 3.4 will you make me a patch that applies to that 
branch? It's failing to apply for ZooKeeperServer. Thanks.

 Add number of client connections metric in JMX and srvr
 ---

 Key: ZOOKEEPER-1321
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4, 3.4.2
Reporter: Neha Narkhede
Assignee: Neha Narkhede
  Labels: patch
 Attachments: ZOOKEEPER-1321_trunk.patch


 The related conversation on the zookeeper user mailing list is here - 
 http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
 It is useful to be able to monitor the number of disconnect operations on a 
 client. This is generally indicative of a client going through large number 
 of GC and hence disconnecting way too often from a zookeeper cluster. 
 Today, this information is only indirectly exposed as part of the stat 
 command which requires counting the results. That's alot of work for the 
 server to do just to get connection count. 
 For monitoring purposes, it will be useful to have this exposed through JMX 
 and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr

2011-12-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176834#comment-13176834
 ] 

Camille Fournier commented on ZOOKEEPER-1321:
-

Sounds good. Remove the TODO added in the Zab1_0Test too please! 

 Add number of client connections metric in JMX and srvr
 ---

 Key: ZOOKEEPER-1321
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.3.4, 3.4.2
Reporter: Neha Narkhede
Assignee: Neha Narkhede
  Labels: patch
 Attachments: ZOOKEEPER-1321_trunk.patch


 The related conversation on the zookeeper user mailing list is here - 
 http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user
 It is useful to be able to monitor the number of disconnect operations on a 
 client. This is generally indicative of a client going through large number 
 of GC and hence disconnecting way too often from a zookeeper cluster. 
 Today, this information is only indirectly exposed as part of the stat 
 command which requires counting the results. That's alot of work for the 
 server to do just to get connection count. 
 For monitoring purposes, it will be useful to have this exposed through JMX 
 and 4lw srvr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads

2011-12-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175951#comment-13175951
 ] 

Camille Fournier commented on ZOOKEEPER-1100:
-

Seems like you all think this is a non-issue, so I will mark it as resolved. 
Please do feel free to re-open if you see the issue again.

 Killed (or missing) SendThread will cause hanging threads
 -

 Key: ZOOKEEPER-1100
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
 Environment: 
 http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E
Reporter: Gunnar Wagenknecht
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1100.patch, ZOOKEEPER-1100.patch


 After investigating an issues with [hanging 
 threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E]
  I noticed that any java.lang.Error might silently kill the SendThread. 
 Without a SendThread any thread that wants to send something will hang 
 forever. 
 Currently nobody will recognize a SendThread that died. I think at least a 
 state should be flipped (or flag should be set) that causes all further send 
 attempts to fail or to re-spin the connection loop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster

2011-12-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174244#comment-13174244
 ] 

Camille Fournier commented on ZOOKEEPER-1333:
-

The logic in FileTxnSnapLog changed quite a bit, and I'm not sure if the create 
check makes sense with the new logic or not. The create check logic was moved 
into DataTree, so what I made the check in FileTxnSnapLog for I'm not entirely 
sure.

 NPE in FileTxnSnapLog when restarting a cluster
 ---

 Key: ZOOKEEPER-1333
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Andrew McNair
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.2

 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, 
 test_case.diff, test_case.diff


 I think a NPE was created in the fix for 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely 
 that if rc.err != Code.OK then rc.path will be null. 
 I'm currently working on a minimal test case for the bug, I'll attach it to 
 this issue when it's ready.
 java.lang.NullPointerException
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203)
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
   at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster

2011-12-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174248#comment-13174248
 ] 

Camille Fournier commented on ZOOKEEPER-1333:
-

Ah ok. Yeah, so if we put the create check in, we won't get that nonode 
exception if the multi fails on that, would be the only potential issue with 
this fix that I can see.

 NPE in FileTxnSnapLog when restarting a cluster
 ---

 Key: ZOOKEEPER-1333
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Andrew McNair
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.2

 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, 
 test_case.diff, test_case.diff


 I think a NPE was created in the fix for 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely 
 that if rc.err != Code.OK then rc.path will be null. 
 I'm currently working on a minimal test case for the bug, I'll attach it to 
 this issue when it's ready.
 java.lang.NullPointerException
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203)
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
   at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster

2011-12-21 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174255#comment-13174255
 ] 

Camille Fournier commented on ZOOKEEPER-1333:
-

But I'm pretty sure that nonode exception itself was kind of a crazy sanity 
check of the we should never reach this sort. To get there you would have to 
be creating a child node that already exists, but with a parent that doesn't 
exist. So it's no surprise that we don't have a test for that case.

 NPE in FileTxnSnapLog when restarting a cluster
 ---

 Key: ZOOKEEPER-1333
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Andrew McNair
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.2

 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, 
 test_case.diff, test_case.diff


 I think a NPE was created in the fix for 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely 
 that if rc.err != Code.OK then rc.path will be null. 
 I'm currently working on a minimal test case for the bug, I'll attach it to 
 this issue when it's ready.
 java.lang.NullPointerException
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203)
   at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
   at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
   at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1202) Prevent certain state transitions in Java client on close(); improve exception handling and enhance client testability

2011-12-19 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172577#comment-13172577
 ] 

Camille Fournier commented on ZOOKEEPER-1202:
-

I think you might just need a longer TIMEOUT for that awaitTermination... the 
thread can sleep for up to 1s in the sendThread run loop before trying to 
reconnect, so on those slow build machines you might just need a bit more 
wiggle room. We don't see it even trying to connect until  2s after the 
session was closed.

 Prevent certain state transitions in Java client on close(); improve 
 exception handling and enhance client testability
 --

 Key: ZOOKEEPER-1202
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1202
 Project: ZooKeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.4.0
Reporter: Matthias Spycher
Assignee: Matthias Spycher
 Attachments: ZOOKEEPER-1202.patch


 ZooKeeper.close() doesn't force the client into a CLOSED state. While the 
 closing flag ensures that the client will close, its state may end up in 
 CLOSED, CONNECTING or CONNECTED.
 I developed a patch and in the process cleaned up a few other things 
 primarily to enable testing of state transitions.
 - ClientCnxnState is new and enforces certain state transitions
 - ZooKeeper.isExpired() is new
 - ClientCnxn no longer refers to ZooKeeper, WatchManager is externalized, and 
 ClientWatchManager includes 3 new methods
 - The SendThread terminates the EventThread on a call to close() via the 
 event-of-death
 - Polymorphism is used to handle internal exceptions (SendIOExceptions)
 - The patch incorporates ZOOKEEPER-126.patch and prevents close() from 
 blocking

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-12-04 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162436#comment-13162436
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

This is a reasonably big bug to just leave outstanding for this long, can 
someone please review this and check it in?

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.5.0, 3.4.1

 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1239) add logging/stats to identify fsync stalls

2011-11-15 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150653#comment-13150653
 ] 

Camille Fournier commented on ZOOKEEPER-1239:
-

Are you sure we should be doing this timing using system.nanotime?

 add logging/stats to identify fsync stalls
 --

 Key: ZOOKEEPER-1239
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1239
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1239_br33.patch, ZOOKEEPER-1239_br34.patch


 We don't have any logging to identify fsync stalls. It's a somewhat common 
 occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - 
 where outstanding requests start piling up and operational latency increases.
 We should have some sort of logging around this. e.g. if the fsync time 
 exceeds some limit then log a warning, something like that.
 It would also be useful to publish stat information related to this. 
 min/avg/max latency for fsync.
 This should also be exposed through JMX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1239) add logging/stats to identify fsync stalls

2011-11-15 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150657#comment-13150657
 ] 

Camille Fournier commented on ZOOKEEPER-1239:
-

Eh, I guess the popular consensus has changed on using nanotime for this sort 
of thing, so disregard my question. I'll put this in shortly.

 add logging/stats to identify fsync stalls
 --

 Key: ZOOKEEPER-1239
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1239
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1239_br33.patch, ZOOKEEPER-1239_br34.patch


 We don't have any logging to identify fsync stalls. It's a somewhat common 
 occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - 
 where outstanding requests start piling up and operational latency increases.
 We should have some sort of logging around this. e.g. if the fsync time 
 exceeds some limit then log a warning, something like that.
 It would also be useful to publish stat information related to this. 
 min/avg/max latency for fsync.
 This should also be exposed through JMX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone

2011-11-14 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149714#comment-13149714
 ] 

Camille Fournier commented on ZOOKEEPER-1208:
-

Actually, I'm not sure... are these useful at all? I'd rather not see printlns 
in test output unless it's really useful, but in the case of this test I'm not 
sure I can tell...

 Ephemeral node not removed after the client session is long gone
 

 Key: ZOOKEEPER-1208
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: kishore gopalakrishna
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch, 
 ZOOKEEPER-1208_br34.patch, ZOOKEEPER-1208_trunk.patch


 Copying from email thread.
 We found our ZK server in a state where an ephemeral node still exists after
 a client session is long gone. I used the cons command on each ZK host to
 list all connections and couldn't find the ephemeralOwner id. We are using
 ZK 3.3.3. Has anyone seen this problem?
 I got the following information from the logs.
 The node that still exists is 
 /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7
 I saw that the ephemeral owner is 86167322861045079 which is session id 
 0x13220b93e610550.
 After searching in the transaction log of one of the ZK servers found that 
 session expired
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 On digging further into the logs I found that there were multiple sessions 
 created in quick succession and every session tried to create the same node. 
 But i verified that the sessions were closed and opened in order
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 
 createSession 6000
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 
 createSession 6000
 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a 
 closeSession null
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e 
 createSession 6000
 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 
 closeSession null
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 
 createSession 6000
 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b 
 closeSession null
 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c 
 createSession 6000
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f 
 closeSession null
 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 
 createSession 6000
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd 
 closeSession null
 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a 
 closeSession null
 Here is the log output for the sessions that tried creating the same node
 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b 
 create 
 

[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone

2011-11-14 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149829#comment-13149829
 ] 

Camille Fournier commented on ZOOKEEPER-1208:
-

Committed to 3.4 and trunk, will get 3.3.4 in a second. Mahadev, feel free to 
cut another 3.4 RC whenever.

 Ephemeral node not removed after the client session is long gone
 

 Key: ZOOKEEPER-1208
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: kishore gopalakrishna
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch, 
 ZOOKEEPER-1208_br34.patch, ZOOKEEPER-1208_trunk.patch


 Copying from email thread.
 We found our ZK server in a state where an ephemeral node still exists after
 a client session is long gone. I used the cons command on each ZK host to
 list all connections and couldn't find the ephemeralOwner id. We are using
 ZK 3.3.3. Has anyone seen this problem?
 I got the following information from the logs.
 The node that still exists is 
 /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7
 I saw that the ephemeral owner is 86167322861045079 which is session id 
 0x13220b93e610550.
 After searching in the transaction log of one of the ZK servers found that 
 session expired
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 On digging further into the logs I found that there were multiple sessions 
 created in quick succession and every session tried to create the same node. 
 But i verified that the sessions were closed and opened in order
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 
 createSession 6000
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 
 createSession 6000
 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a 
 closeSession null
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e 
 createSession 6000
 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 
 closeSession null
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 
 createSession 6000
 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b 
 closeSession null
 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c 
 createSession 6000
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f 
 closeSession null
 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 
 createSession 6000
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd 
 closeSession null
 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a 
 closeSession null
 Here is the log output for the sessions that tried creating the same node
 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 Let me know if you need 

[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone

2011-11-11 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148517#comment-13148517
 ] 

Camille Fournier commented on ZOOKEEPER-1208:
-

I like the fix, Pat.

 Ephemeral node not removed after the client session is long gone
 

 Key: ZOOKEEPER-1208
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.3.3
Reporter: kishore gopalakrishna
Assignee: Patrick Hunt
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch


 Copying from email thread.
 We found our ZK server in a state where an ephemeral node still exists after
 a client session is long gone. I used the cons command on each ZK host to
 list all connections and couldn't find the ephemeralOwner id. We are using
 ZK 3.3.3. Has anyone seen this problem?
 I got the following information from the logs.
 The node that still exists is 
 /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7
 I saw that the ephemeral owner is 86167322861045079 which is session id 
 0x13220b93e610550.
 After searching in the transaction log of one of the ZK servers found that 
 session expired
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 On digging further into the logs I found that there were multiple sessions 
 created in quick succession and every session tried to create the same node. 
 But i verified that the sessions were closed and opened in order
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 
 createSession 6000
 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 
 closeSession null
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 
 createSession 6000
 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a 
 closeSession null
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e 
 createSession 6000
 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 
 closeSession null
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 
 createSession 6000
 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b 
 closeSession null
 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c 
 createSession 6000
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f 
 closeSession null
 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 
 createSession 6000
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd 
 closeSession null
 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 
 createSession 6000
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a 
 closeSession null
 Here is the log output for the sessions that tried creating the same node
 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b 
 create 
 '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7
 Let me know if you need additional information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA 

[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-04 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144378#comment-13144378
 ] 

Camille Fournier commented on ZOOKEEPER-1270:
-

2 acks is expected. This threw me the first time I saw it in the code, but it's 
right as far as I could reason looking through follower and leader, the first 
ack is after NEWLEADER, the second is right before we start the zk server.

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, 
 testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, 
 testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-04 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144562#comment-13144562
 ] 

Camille Fournier commented on ZOOKEEPER-1270:
-

If readyToStart becomes unused with this patch can we please go ahead and 
remove it?

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Assignee: Flavio Junqueira
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270-and-1194.patch, 
 ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270.patch, 
 ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, 
 ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, 
 testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, 
 testEarlyLeaderAbandonment4.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-04 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144573#comment-13144573
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Oh now I see. Because 1192 introduced fixes into leader election that added 
stuff to the Zab1_0Test that I missed. Why in the world do we have leader 
election bugs going only into trunk instead of into 3.4 as well??? Not good.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264-branch34.patch, 
 ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, 
 ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-03 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143198#comment-13143198
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Ben, just two questions:
Does this logic really only apply to FollowerZookeeperServers or should 
observers also do this?

Why does the playing of these txns to the log come after we start the zk server 
instead of before?

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
 ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, 
 followerresyncfailure_log.txt.gz, logs.zip, tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-03 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143542#comment-13143542
 ] 

Camille Fournier commented on ZOOKEEPER-1270:
-

There's some extraneous stuff in ClientBase, but if anyone can repro this bug 
locally and run it with this stack tracing on that would be useful.

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, 
 testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-03 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143602#comment-13143602
 ] 

Camille Fournier commented on ZOOKEEPER-1270:
-

It seems to me that everything comes up ok, and starts the election process, 
elects a leader, and gets a snapshot from the leader. But in the logs where you 
have 2 followers very closely synched in time (never my local box but seems to 
happen on the build boxes occasionally), after the followers have claimed to 
write a snapshot to disk (which means they must have gotten the NEWLEADER 
message) the whole process then stops, and you see no logs from the leader 
indicating it ran processAck for either follower. It feels to me like it could 
be a race condition in the leader somewhere, causing it to somehow miss that 
ACK but I can't seem to find it. Nothing in the diffs from the checkin related 
to ZAB1.0 seem to be much of a culprit... I'm a bit stumped but going to keep 
looking.

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, 
 testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

2011-11-03 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143690#comment-13143690
 ] 

Camille Fournier commented on ZOOKEEPER-1270:
-

Looking at this some more I'm not entirely convinced it isn't a timing issue:
{quote}
I'm skeptical about it being a time issue because we wait 10 seconds for the 
waitForAll call to complete, but I'm not sure if this completely unrealistic or 
not assuming that the jenkins machine is overloaded.
{quote}
I actually have the startup and shutdown running in a loop on my box. The one 
time I managed to get it to fail was due to 10 seconds not being a long enough 
wait time. The servers were almost up, in fact, but election just took a little 
while as did snapshotting etc and it never succeeded. 

 testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
 -

 Key: ZOOKEEPER-1270
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, 
 testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz


 Looks pretty serious - quorum is formed but no clients can attach. Will 
 attach logs momentarily.
 This test was introduced in the following commit (all three jira commit at 
 once):
 ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their 
 logs.
 ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
 ZOOKEEPER-1082. modify leader election to correctly take into account current

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-03 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143714#comment-13143714
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Ok, I think this is all fine. I will check this in.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
 ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, 
 followerresyncfailure_log.txt.gz, logs.zip, tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-02 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142211#comment-13142211
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Because when the follower writes a new log file without writing a snapshot with 
the old transactions, on restart the ZK thinks it has the transactions up to 
the zxid in the log file. The fact that these transactions were never written 
to a log or snapshot by the follower is not captured. We got a NEWLEADER and 
took a snapshot, then got a bunch of txns that went directly to our data tree, 
then got UPTODATE, then some other new transactions that caused the creation of 
a brand new log file. The intermediate transactions between NEWLEADER and 
UPTODATE are never written to a persistent store on the follower unless it 
manages to stay alive long enough to do another snapshot.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
 ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, 
 followerresyncfailure_log.txt.gz, logs.zip, tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-11-02 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142225#comment-13142225
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

I think it should go into both, since it is a bug with multi.

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-02 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142274#comment-13142274
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Seems to work. I want to go ahead and put in the additional changes to 
FollowerResyncConcurrencyTest along with your patch after I finish reviewing 
it. Theoretically they aren't needed but given how many times this test has 
caught issues I think it's worth it to double-test this stuff. Let me know if 
you disagree.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, 
 ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-02 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142373#comment-13142373
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Yup will do asap (which might be early this evening but I'll try to get it in a 
few mins).

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, 
 ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-11-02 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142392#comment-13142392
 ] 

Camille Fournier commented on ZOOKEEPER-1246:
-

Thomas, a bit of feedback. This is unnecessarily aggressive and annoying, and 
coming after I smacked you down for not writing tests for your own bugfixes it 
makes you look incredibly petty and insecure. It is perfectly fair of you to 
point out that I added an eclipse warning (guilty as charged, but if you really 
care about these you need to make the build fail when additional warnings are 
added). And yes, the formatting is not perfect. But as to most of the rest of 
your points, you can frankly go to hell if you think I'm going to tolerate 
being condescended to in this manner. You had the opportunity to fix this bug 
yourself when you reported it. Instead, you pranced off to work on your own 
thing and left it to me to debug and provide a fix. Now that the fix is done 
and somehow not to your liking, the best you could hope for here is to request 
a fix for the warning and formatting errors, and otherwise submit a new patch 
as a refactor. 

I'm closing this back up, and you are welcome to open a new issue with 
formatting fixes/refactors on it if you so choose. But it is certainly not a 
critical bug any longer.

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Thomas Koch
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, 
 ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246_trunk.patch, 
 ZOOKEEPER-1246_trunk.patch


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141191#comment-13141191
 ] 

Camille Fournier commented on ZOOKEEPER-1246:
-

Will do.

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, 
 ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141205#comment-13141205
 ] 

Camille Fournier commented on ZOOKEEPER-1246:
-

Oh brilliant, yet another refactoring blew away the trunk patch here. 

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, 
 ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141211#comment-13141211
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

Hey guys, someone want to review and commit this? Looks like we got the OK from 
the multi folks.

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1136) NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141234#comment-13141234
 ] 

Camille Fournier commented on ZOOKEEPER-1136:
-

This change causes a concurrency bug. Specifically:
1. Follower rejoins, gets snap from leader
2. Follower gets NEWLEADER message and takes a snapshot
3. Follower gets some additional tranactions forwarded from leader, applies 
these directly to data tree
4. Follower gets an UPTODATE message, does not take a snapshot
5. Follower starts following, writes some new transactions to its log, and is 
killed before it takes another snapshot
6. Follower restarts and gets a DIFF from the leader

The transactions that came in between NEWLEADER and UPTODATE are lost because 
they never go anywhere but the internal data tree, and if that tree isn't 
snapshotted and the follower restarts with only a DIFF, the follower will lose 
these transactions.

I think the proper thing to do is snapshot after UPTODATE, but I'm not sure why 
we changed this to snapshot after NEWLEADER instead. The wiki doesn't seem to 
explain that clearly. If one of you could check on 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264 and let me know the 
reasoning, that would be helpful.

 NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the 
 twiki
 ---

 Key: ZOOKEEPER-1136
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1136
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Benjamin Reed
Assignee: Benjamin Reed
Priority: Blocker
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-1136.patch, ZOOKEEPER-1136.patch, 
 ZOOKEEPER-1136.patch


 the NEW_LEADER message was sent at the beginning of the sync phase in Zab 
 pre1.0, but it must be at the end in Zab 1.0. if the protocol is 1.0 or 
 greater we need to queue rather than send the packet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141235#comment-13141235
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

From a comment I added to the tracker that this change was attached to:
ZOOKEEPER-1136 causes a concurrency bug. Specifically:
1. Follower rejoins, gets snap from leader
2. Follower gets NEWLEADER message and takes a snapshot
3. Follower gets some additional tranactions forwarded from leader, applies 
these directly to data tree
4. Follower gets an UPTODATE message, does not take a snapshot
5. Follower starts following, writes some new transactions to its log, and is 
killed before it takes another snapshot
6. Follower restarts and gets a DIFF from the leader

The transactions that came in between NEWLEADER and UPTODATE are lost because 
they never go anywhere but the internal data tree, and if that tree isn't 
snapshotted and the follower restarts with only a DIFF, the follower will lose 
these transactions.

I think the proper thing to do is snapshot after UPTODATE, but I'm not sure why 
we changed this to snapshot after NEWLEADER instead. The wiki doesn't seem to 
explain that clearly. 

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141248#comment-13141248
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Thanks Ben. The patch I attached changes both Learner and 
FollowerResyncConcurrencyTest. You should be able to repro the failure with 
testResyncBySnapThenDiffAfterFollowerCrashes pretty reliably. You can ignore 
the changes in Learner (just move the snap to after UPTODATE instead of 
NEWLEADER0.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
 followerresyncfailure_log.txt.gz, logs.zip, tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141285#comment-13141285
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Yeah, sorry, these concurrency tests are pretty much impossible to write 
deterministically without some additional scaffolding. If you look at lines 
152-158 of the test, you want the thread that I started to have transactions 
passing through the leader when the qu.restart at 153 loads the follower. The 
follower should get a snapshot from the leader, a few more pending 
transactions, and then additional transactions that cause a log file to be 
written that will have a zxid that is not the zxid of the snapshot it created + 
1. For example from Pat's log:
2011-10-28 17:09:56,691 [myid:] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11221:FileTxnSnapLog@255] - Snapshotting: 
12322

(indicating the NEWLEADER)
then 

2011-10-28 17:09:59,316 [myid:] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11221:Follower@118] - Got zxid 0x12c3e 
expected 0x1
2011-10-28 17:09:59,330 [myid:] - INFO  [SyncThread:1:FileTxnLog@195] - 
Creating new log file: log.12c3e





 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, 
 ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, 
 followerresyncfailure_log.txt.gz, logs.zip, tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141487#comment-13141487
 ] 

Camille Fournier commented on ZOOKEEPER-1100:
-

I'm reviewing this issue. Can I get some clarity? Is the issue that you get a 
runtime exception outside of the try block after while (state.isAlive()) so the 
thread dies and hangs? Why put the try block there instead of around the entire 
method?

 Killed (or missing) SendThread will cause hanging threads
 -

 Key: ZOOKEEPER-1100
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
 Environment: 
 http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E
Reporter: Gunnar Wagenknecht
Assignee: Rakesh R
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1100.patch


 After investigating an issues with [hanging 
 threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E]
  I noticed that any java.lang.Error might silently kill the SendThread. 
 Without a SendThread any thread that wants to send something will hang 
 forever. 
 Currently nobody will recognize a SendThread that died. I think at least a 
 state should be flipped (or flag should be set) that causes all further send 
 attempts to fail or to re-spin the connection loop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads

2011-11-01 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141526#comment-13141526
 ] 

Camille Fournier commented on ZOOKEEPER-1100:
-

More to the point, are you expecting just a watcher event for this? As it 
stands, if your send thread dies you will still have send requests hang even 
with a cleanup call because the state doesn't change to anything but 
CONNECTING. If just getting a watch event and notification on pending send 
requests is fine, then I think we can work with this.

 Killed (or missing) SendThread will cause hanging threads
 -

 Key: ZOOKEEPER-1100
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.3
 Environment: 
 http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E
Reporter: Gunnar Wagenknecht
Assignee: Rakesh R
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1100.patch


 After investigating an issues with [hanging 
 threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E]
  I noticed that any java.lang.Error might silently kill the SendThread. 
 Without a SendThread any thread that wants to send something will hang 
 forever. 
 Currently nobody will recognize a SendThread that died. I think at least a 
 state should be flipped (or flag should be set) that causes all further send 
 attempts to fail or to re-spin the connection loop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-30 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139821#comment-13139821
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Got this reproduced on my local box with yet more hacks to the test and a few 
sleeps in the source code. Should be close to figuring out the problem, 
probably tomorrow sometime. Stay tuned.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-30 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139879#comment-13139879
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

OK, I found the bug. Ben, we could use your attention here.

The problem is that we queue NEWLEADER before we queue UPTODATE, but inbetween 
these messages we send more sync packets to move us from SNAP to, well, 
UPTODATE. These get written directly to the data tree, bypassing the log. But 
if you immediately shut down the ZK before snapshotting again, you will lose 
any record of these transactions on the ZK in question. It seems to me that we 
should either snapshot again on UPTODATE or else wait to snapshot in the first 
place until that packet is sent. I don't understand why we moved to snapshot on 
NEWLEADER in the first place. If one of the ZAB 1.0 authors could comment, that 
would be useful.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-10-29 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139348#comment-13139348
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

Right, ok. So I think the patch attached to this issue does exactly that, if 
someone would like to review it. What I'm not sure is whether the test I put in 
is particularly good, so would really appreciate one of the multi experts 
taking a gander there.

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138388#comment-13138388
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

This looks like a good cleanup, thanks Patrick.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138397#comment-13138397
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Committed to trunk, 3.3.4 and 3.4 branches.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138796#comment-13138796
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

The test here is a little bit vague, because I don't really understand how a 
proper but broken multitxn would look. Handling the error codes in 
FileTxnSnapLog is also a bit fuzzy. But I think the general refactor should fix 
the issue. Would be great if Marshall could take a look at this to verify.

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138894#comment-13138894
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Looking.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138963#comment-13138963
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

It might also be somewhat helpful if you could send me the txn logs from the 
test servers but I realize that might be too much to ask.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138984#comment-13138984
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

Are you sure about that given 
https://issues.apache.org/jira/browse/ZOOKEEPER-1046?

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138987#comment-13138987
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Yeah, I spent a bit of time looking at this. I have a few ideas but it would 
probably go a lot faster if I had logs to examine since I can't seem to repro 
it myself. If you can get me some I will look more this weekend.

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139004#comment-13139004
 ] 

Camille Fournier commented on ZOOKEEPER-1269:
-

Ah yes, being in the log is enough for it to be true were snapshots taken in a 
frozen system state. But since they are not, you can have these operations fail 
in playback due to concurrency issues. Multi isn't a special case above the 
other zk ops, they all have this potential race.

 Multi deserialization issues
 

 Key: ZOOKEEPER-1269
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.0
Reporter: Camille Fournier
 Attachments: ZOOKEEPER-1269.patch


 From the mailing list:
 FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure 
 during deserialization. The problem is explained there in a code comment. The 
 code block however is only executed for a CREATE txn, not for a multiTxn 
 containing a CREATE.
 Even if the mentioned code block would also be executed for multi 
 transactions, it needs adaption for multi transactions. What, if after the 
 first failed transaction in a multi txn during deserialization, there would 
 be subsequent transactions in the same multi that would also have failed?
 We don't know, since the first failed transaction hides the information about 
 the remaining transactions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

2011-10-28 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139019#comment-13139019
 ] 

Camille Fournier commented on ZOOKEEPER-1264:
-

Thanks Patrick. My suspicions were true, the failing zk has a chunk missing out 
of its logs that corresponds to the missing ephemeral nodes (snapshot 
snapshot.12322, log log.12c3e, but the earlier log file doesn't have 
txns between 2322 and 2c3e, they seem to just be missing). Now to figure out 
why it doesn't have those log files...

 FollowerResyncConcurrencyTest failing intermittently
 

 Key: ZOOKEEPER-1264
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.3.3, 3.4.0, 3.5.0
Reporter: Patrick Hunt
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, 
 ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, 
 tmp.zip


 The FollowerResyncConcurrencyTest test is failing intermittently. 
 saw the following on 3.4:
 {noformat}
 junit.framework.AssertionFailedError: Should have same number of
 ephemerals in both followers expected:11741 but was:14001
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
at 
 org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
at 
 org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-10-26 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136308#comment-13136308
 ] 

Camille Fournier commented on ZOOKEEPER-1246:
-

Thanks for migrating this to trunk, Patrick!

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, 
 ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1248) multi transaction sets request.exception without reason

2011-10-25 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135362#comment-13135362
 ] 

Camille Fournier commented on ZOOKEEPER-1248:
-

It's those damned read-only mode tests that seem to be so buggy that are 
failing. Do we think this failure is meaningful or not?

 multi transaction sets request.exception without reason
 ---

 Key: ZOOKEEPER-1248
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1248
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Assignee: Thomas Koch
 Attachments: ZOOKEEPER-1248.patch, ZOOKEEPER-1248.patch


 I'm trying to understand the purpose of the exception field in request. This 
 isn't made easier by the fact that the multi case in PrepRequestProcessor 
 sets the exception without reason.
 The only code that calls request.getException() is in FinalRequestProcessor 
 and this code only acts when the operation _is not_ a multi operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-10-25 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135376#comment-13135376
 ] 

Camille Fournier commented on ZOOKEEPER-1246:
-

Ok, after a bit of looking it looks like what we need to do is catch 
IOException and appropriately raise that as a marshalling error. I am going to 
see what I can do to get a test for this.

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Priority: Blocker
 Fix For: 3.4.0, 3.5.0


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block

2011-10-25 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135410#comment-13135410
 ] 

Camille Fournier commented on ZOOKEEPER-1246:
-

Formatting may be wack and I haven't gone over it with a fine tooth comb but I 
think this patch takes care of it.

 Dead code in PrepRequestProcessor catch Exception block
 ---

 Key: ZOOKEEPER-1246
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Thomas Koch
Priority: Blocker
 Fix For: 3.4.0, 3.5.0

 Attachments: ZOOKEEPER-1246.patch


 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The 
 catch(Exception e) block in PrepRequestProcessor.pRequest contains an if 
 block with condition request.getHdr() != null. This condition will always 
 evaluate to false since the changes in ZOOKEEPER-965.
 This is caused by a change in sequence: Before ZK-965, the txnHeader was set 
 _before_ the deserialization of the request. Afterwards the deserialization 
 happens before request.setHdr is set. So the following RequestProcessors 
 won't see the request as a failed one but as a Read request, since it doesn't 
 have a hdr set.
 Notes:
 - it is very bad practice to catch Exception. The block should rather catch 
 IOException
 - The check whether the TxnHeader is set in the request is used at several 
 places to see whether the request is a read or write request. It isn't 
 obvious for a newby, what it means whether a request has a hdr set or not.
 - at the beginning of pRequest the hdr and txn of request are set to null. 
 However there is no chance that these fields could ever not be null at this 
 point. The code however suggests that this could be the case. There should 
 rather be an assertion that confirms that these fields are indeed null. The 
 practice of doing things just in case, even if there is no chance that this 
 case could happen, is a very stinky code smell and means that the code isn't 
 understandable or trustworthy.
 - The multi transaction switch case block in pRequest is very hard to read, 
 because it missuses the request.{hdr|txn} fields as local variables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1243) New 4lw for short simple monitoring ldck

2011-10-24 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134493#comment-13134493
 ] 

Camille Fournier commented on ZOOKEEPER-1243:
-

Added html docs, removed println in test. Can someone please review this? We've 
been suffering heavily from ZOOKEEPER-1197 and I would really appreciate it if 
we could get this in to 3.4

 New 4lw for short simple monitoring ldck
 

 Key: ZOOKEEPER-1243
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1243
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0

 Attachments: ZOOKEEPER-1243-2, ZOOKEEPER-1243-4.patch, 
 ZOOKEEPER-1243.patch


 The existing monitoring fails so often due to 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1197 that we need a 
 workaround. This introduces a short 4lw called ldck that just runs 
 ServerStats.toString to get information about the sever's leadership status.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1243) New 4lw for short simple monitoring ldck

2011-10-24 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134558#comment-13134558
 ] 

Camille Fournier commented on ZOOKEEPER-1243:
-

Oh, you are right, I thought it was weird we didn't have this. Why we chose to 
put the srvr command in the same command thread as stat with the only 
differentiator be a guarding if statement... Ok I will close this, thanks.

 New 4lw for short simple monitoring ldck
 

 Key: ZOOKEEPER-1243
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1243
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0

 Attachments: ZOOKEEPER-1243-2, ZOOKEEPER-1243-4.patch, 
 ZOOKEEPER-1243.patch


 The existing monitoring fails so often due to 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1197 that we need a 
 workaround. This introduces a short 4lw called ldck that just runs 
 ServerStats.toString to get information about the sever's leadership status.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1243) New 4lw for short simple monitoring ldck

2011-10-24 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134564#comment-13134564
 ] 

Camille Fournier commented on ZOOKEEPER-1243:
-

Indeed... put it on the todo list. 

 New 4lw for short simple monitoring ldck
 

 Key: ZOOKEEPER-1243
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1243
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Camille Fournier
Assignee: Camille Fournier
Priority: Blocker
 Fix For: 3.3.4, 3.4.0

 Attachments: ZOOKEEPER-1243-2, ZOOKEEPER-1243-4.patch, 
 ZOOKEEPER-1243.patch


 The existing monitoring fails so often due to 
 https://issues.apache.org/jira/browse/ZOOKEEPER-1197 that we need a 
 workaround. This introduces a short 4lw called ldck that just runs 
 ServerStats.toString to get information about the sever's leadership status.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ZOOKEEPER-1237) ERRORs being logged when queued responses are sent after socket has closed.

2011-10-20 Thread Camille Fournier (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13132146#comment-13132146
 ] 

Camille Fournier commented on ZOOKEEPER-1237:
-

Why do we ignore that exception in sendBuffer, instead of closing the 
connection at that point?

 ERRORs being logged when queued responses are sent after socket has closed.
 ---

 Key: ZOOKEEPER-1237
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1237
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.4, 3.4.0, 3.5.0
Reporter: Patrick Hunt
 Fix For: 3.3.4, 3.4.0, 3.5.0


 After applying ZOOKEEPER-1049 to 3.3.3 (I believe the same problem exists in 
 3.4/3.5 but haven't tested this) I'm seeing the following exception more 
 frequently:
 {noformat}
 Oct 19, 1:31:53 PM ERROR
 Unexpected Exception:
 java.nio.channels.CancelledKeyException
 at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
 at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
 at 
 org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
 at 
 org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
 at 
 org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
 at 
 org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
 {noformat}
 This is a long standing problem where we try to send a response after the 
 socket has been closed. Prior to ZOOKEEPER-1049 this issues happened much 
 less frequently (2 sec linger), but I believe it was possible. The timing 
 window is just wider now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >