[jira] [Commented] (ZOOKEEPER-1448) Node+Quota creation in transaction log can crash leader startup
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255749#comment-13255749 ] Camille Fournier commented on ZOOKEEPER-1448: - Good catch. Can you provide a patch for this? Node+Quota creation in transaction log can crash leader startup --- Key: ZOOKEEPER-1448 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1448 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.5 Reporter: Botond Hejj Fix For: 3.3.6 Hi, I've found a bug in zookeeper related to quota creation which can shutdown zookeeper leader on startup. Steps to reproduce: 1. create /quota_bug 2. setquota -n 1 /quota_bug 3. stop the whole ensemble (the previous operations should be in the transaction log) 4. start all the servers 5. the elected leader will shutdown with an exception (Missing stat node for count /zookeeper/quota/quota_bug/zookeeper_ stats) I've debugged a bit what happening and I found the following problem: On startup each server loads the last snapshot and replays the last transaction log. While doing this it fills up the pTrie variable of the DataTree with the path of the nodes which have quota. After the leader is elected the leader servers loads the snapshot and last transaction log but it doesn't clean up the pTrie variable. This means it still contains the /quota_bug path. Now when the create /quota_bug is processed from the transaction log the DataTree already thinks that the quota nodes (/zookeeper/quota/quota_bug/zookeeper_limits and /zookeeper/quota/quota_bug/zookeeper_stats) are created but those node creation actually comes later in the transaction log. This leads to the missing stat node exception. I think clearing the pTrie should solve this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1449) Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255836#comment-13255836 ] Camille Fournier commented on ZOOKEEPER-1449: - Can you reproduce it with a more recent release? 3.3.3 is a bit old at this point and we've fixed a few things between that and 3.3.5. Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state) --- Key: ZOOKEEPER-1449 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1449 Project: ZooKeeper Issue Type: Bug Reporter: Daniel Lord Attachments: zk.zip I've been running in to this situation in our labs fairly regularly where we'll get a single follower that will be in an inconsistent state with dangling ephemeral znodes. Here is all of the information that I have right now. Please ask if there is anything else that is useful. Here is a quick snapshot of the state of the ensemble where you can see it is out of sync across several znodes: -bash-3.2$ echo srvr | nc il23n04sa-zk001 2181 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT Latency min/avg/max: 0/7/25802 Received: 64002 Sent: 63985 Outstanding: 0 Zxid: 0x50a41 Mode: follower Node count: 497 -bash-3.2$ echo srvr | nc il23n04sa-zk002 2181 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT Latency min/avg/max: 0/13/79032 Received: 74320 Sent: 74276 Outstanding: 0 Zxid: 0x50a41 Mode: leader Node count: 493 -bash-3.2$ echo srvr | nc il23n04sa-zk003 2181 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT Latency min/avg/max: 0/2/25234 Received: 187310 Sent: 187320 Outstanding: 0 Zxid: 0x50a41 Mode: follower Node count: 493 All of the zxids match up just fine but zk001 has 4 more nodes in its node count than the other two (including the leader..). When I use a zookeeper client connect to connect directly to zk001 I can see the following znode that should no longer exist: [zk: localhost:2181(CONNECTED) 0] stat /siri/Douroucouli/clients/il23n04sa-app004.siri.apple.com:38096 cZxid = 0x4001a ctime = Mon Apr 16 11:00:47 PDT 2012 mZxid = 0x4001a mtime = Mon Apr 16 11:00:47 PDT 2012 pZxid = 0x4001a cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x236bc504cb50002 dataLength = 0 numChildren = 0 This node does not exist using the client to connect to either of the other two members of the ensemble. I searched through the logs for that session id and it looks like it was established and closed cleanly. There were several leadership/quorum problems during the course of the session but it looks like it should have been shut down properly. Neither the session nor the znode show up in the dump on the leader but the problem znode does show up in the dump on zk001. 2012-04-16 11:00:47,637 - INFO [CommitProcessor:2:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:38971 2012-04-16 11:20:59,341 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x236bc504cb50002 at /17.202.71.201:50841 2012-04-16 11:20:59,342 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:50841 2012-04-16 11:21:09,343 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x236bc504cb50002, likely client has closed socket 2012-04-16 11:21:09,343 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:50841 which had sessionid 0x236bc504cb50002 2012-04-16 11:21:20,352 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:38971 which had sessionid 0x236bc504cb50002 2012-04-16 11:21:22,151 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x236bc504cb50002 at /17.202.71.201:38166 2012-04-16 11:21:22,152 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:38166 2012-04-16 11:27:17,902 - INFO [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x236bc504cb50002 2012-04-16 11:27:17,904 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:38166 which had
[jira] [Commented] (ZOOKEEPER-1449) Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255902#comment-13255902 ] Camille Fournier commented on ZOOKEEPER-1449: - Virtualization shouldn't be a problem. It's probably one of those bugs listed above, but if not we'll definitely want to track it down. Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state) --- Key: ZOOKEEPER-1449 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1449 Project: ZooKeeper Issue Type: Bug Reporter: Daniel Lord Attachments: zk.zip I've been running in to this situation in our labs fairly regularly where we'll get a single follower that will be in an inconsistent state with dangling ephemeral znodes. Here is all of the information that I have right now. Please ask if there is anything else that is useful. Here is a quick snapshot of the state of the ensemble where you can see it is out of sync across several znodes: -bash-3.2$ echo srvr | nc il23n04sa-zk001 2181 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT Latency min/avg/max: 0/7/25802 Received: 64002 Sent: 63985 Outstanding: 0 Zxid: 0x50a41 Mode: follower Node count: 497 -bash-3.2$ echo srvr | nc il23n04sa-zk002 2181 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT Latency min/avg/max: 0/13/79032 Received: 74320 Sent: 74276 Outstanding: 0 Zxid: 0x50a41 Mode: leader Node count: 493 -bash-3.2$ echo srvr | nc il23n04sa-zk003 2181 Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT Latency min/avg/max: 0/2/25234 Received: 187310 Sent: 187320 Outstanding: 0 Zxid: 0x50a41 Mode: follower Node count: 493 All of the zxids match up just fine but zk001 has 4 more nodes in its node count than the other two (including the leader..). When I use a zookeeper client connect to connect directly to zk001 I can see the following znode that should no longer exist: [zk: localhost:2181(CONNECTED) 0] stat /siri/Douroucouli/clients/il23n04sa-app004.siri.apple.com:38096 cZxid = 0x4001a ctime = Mon Apr 16 11:00:47 PDT 2012 mZxid = 0x4001a mtime = Mon Apr 16 11:00:47 PDT 2012 pZxid = 0x4001a cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x236bc504cb50002 dataLength = 0 numChildren = 0 This node does not exist using the client to connect to either of the other two members of the ensemble. I searched through the logs for that session id and it looks like it was established and closed cleanly. There were several leadership/quorum problems during the course of the session but it looks like it should have been shut down properly. Neither the session nor the znode show up in the dump on the leader but the problem znode does show up in the dump on zk001. 2012-04-16 11:00:47,637 - INFO [CommitProcessor:2:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:38971 2012-04-16 11:20:59,341 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x236bc504cb50002 at /17.202.71.201:50841 2012-04-16 11:20:59,342 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:50841 2012-04-16 11:21:09,343 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x236bc504cb50002, likely client has closed socket 2012-04-16 11:21:09,343 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:50841 which had sessionid 0x236bc504cb50002 2012-04-16 11:21:20,352 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:38971 which had sessionid 0x236bc504cb50002 2012-04-16 11:21:22,151 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x236bc504cb50002 at /17.202.71.201:38166 2012-04-16 11:21:22,152 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:38166 2012-04-16 11:27:17,902 - INFO [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x236bc504cb50002 2012-04-16 11:27:17,904 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:38166 which had
[jira] [Commented] (ZOOKEEPER-1442) Uncaught exception handler should exit on a java.lang.Error
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253446#comment-13253446 ] Camille Fournier commented on ZOOKEEPER-1442: - My biggest question mark is around exiting on ThreadDeath, and I'd like to get a bit of community feedback before committing. But if I can get some color around those concerns I'm ok with the patch. Uncaught exception handler should exit on a java.lang.Error --- Key: ZOOKEEPER-1442 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1442 Project: ZooKeeper Issue Type: Bug Components: java client, server Affects Versions: 3.4.3, 3.3.5 Reporter: Jeremy Stribling Assignee: Jeremy Stribling Priority: Minor Attachments: ZOOKEEPER-1442.patch The uncaught exception handler registered in NIOServerCnxnFactory and ClientCnxn simply logs exceptions and lets the rest of ZooKeeper go on its merry way. However, errors such as OutOfMemoryErrors should really crash the program, as they represent unrecoverable errors. If the exception that gets to the uncaught exception handler is an instanceof a java.lang.Error, ZK should exit with an error code (in addition to logging the error). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1375) SendThread is exiting after OOMError
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235102#comment-13235102 ] Camille Fournier commented on ZOOKEEPER-1375: - If your client throws an OOM error, there's no guarantee that you will be able to do anything at all beyond that point. It's not clear to me what you hope to do about it. What are the users going to do when they can't act themselves due to the OOM state? SendThread is exiting after OOMError Key: ZOOKEEPER-1375 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1375 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.0 Reporter: Rakesh R After reviewing the ClientCnxn code, there is still chances of exiting the SendThread without intimating the users. Say if client throws OOMError and entered into the throwable block. Here again while sending the Disconnected event, its creating new WatchedEvent() object.This will throw OOMError and leads to exit the SendThread without any Disconnected event notification. {noformat} try{ //... } catch (Throwable e) { //.. cleanup(); if(state.isAlive()){ eventThread.queueEvent( new WatchedEvent(Event.EventType.None, Event.KeeperState.Disconnected, null) ) } // } {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1375) SendThread is exiting after OOMError
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235271#comment-13235271 ] Camille Fournier commented on ZOOKEEPER-1375: - A server ran out of memory? This ticket is for the client code, not the server code. More likely NIOServerCnxn than ClientCnxn as you mention. OOM stuff can cause VMs to behave very strangely, which is why I generally think it's best to fail big and fail fast when it happens. There's not really any sense in trying to recover because beyond that point the behavior is pretty non-deterministic. Strange that the other VMs wouldn't form a quorum though... might be interesting to dig into. Feel free to open another ticket with some more info and we can dig into it more. SendThread is exiting after OOMError Key: ZOOKEEPER-1375 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1375 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.0 Reporter: Rakesh R After reviewing the ClientCnxn code, there is still chances of exiting the SendThread without intimating the users. Say if client throws OOMError and entered into the throwable block. Here again while sending the Disconnected event, its creating new WatchedEvent() object.This will throw OOMError and leads to exit the SendThread without any Disconnected event notification. {noformat} try{ //... } catch (Throwable e) { //.. cleanup(); if(state.isAlive()){ eventThread.queueEvent( new WatchedEvent(Event.EventType.None, Event.KeeperState.Disconnected, null) ) } // } {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1407) Support GetData and GetChildren in Multi
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233631#comment-13233631 ] Camille Fournier commented on ZOOKEEPER-1407: - Zhihong, it's good if you change the state to Patch Available when you've got something for us to look at. We generally look at the patch available queue to determine what we need to review, etc. It will also trigger the automated build check. I've set this one to patch available. Support GetData and GetChildren in Multi Key: ZOOKEEPER-1407 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1407 Project: ZooKeeper Issue Type: Improvement Reporter: Zhihong Yu Fix For: 3.5.0 Attachments: 1407-v2.txt, 1407.txt There is use case where GetData and GetChildren would participate in Multi. We should add support for this case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234029#comment-13234029 ] Camille Fournier commented on ZOOKEEPER-1100: - 3.4.X and trunk, I believe. Are you seeing it in 3.4.X? We did a big refactor between 3.3.X and 3.4... I can look for a jira if you're interested. Killed (or missing) SendThread will cause hanging threads - Key: ZOOKEEPER-1100 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Environment: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E Reporter: Gunnar Wagenknecht Fix For: 3.5.0 Attachments: ZOOKEEPER-1100.patch, ZOOKEEPER-1100.patch After investigating an issues with [hanging threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E] I noticed that any java.lang.Error might silently kill the SendThread. Without a SendThread any thread that wants to send something will hang forever. Currently nobody will recognize a SendThread that died. I think at least a state should be flipped (or flag should be set) that causes all further send attempts to fail or to re-spin the connection loop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1419) Leader election never settles for a 5-node cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233054#comment-13233054 ] Camille Fournier commented on ZOOKEEPER-1419: - I'm gonna check this in to trunk and 3.4 tonight. Leader election never settles for a 5-node cluster -- Key: ZOOKEEPER-1419 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1419 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.3, 3.5.0 Environment: 64-bit Linux, all nodes running on the same machine (different ports) Reporter: Jeremy Stribling Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1419-fixed2.tgz, ZOOKEEPER-1419.patch, ZOOKEEPER-1419.patch, ZOOKEEPER-1419.patch We have a situation where it seems to my untrained eye that leader election never finishes for a 5-node cluster. In this test, all nodes are ZK 3.4.3 and running on the same server (listening on different ports, of course). The nodes have server IDs of 0, 1, 2, 3, 4. The test brings up the cluster in different configurations, adding in a new node each time. We embed ZK in our application, so when we shut a node down and restart it with a new configuration, it all happens in a single JVM process. Here's our server startup code (for the case where there's more than one node in the cluster): {code} if (servers.size() 1) { _log.debug(Starting Zookeeper server in quorum server mode); _quorum_peer = new QuorumPeer(); synchronized(_quorum_peer) { _quorum_peer.setClientPortAddress(clientAddr); _quorum_peer.setTxnFactory(log); _quorum_peer.setQuorumPeers(servers); _quorum_peer.setElectionType(_election_alg); _quorum_peer.setMyid(_server_id); _quorum_peer.setTickTime(_tick_time); _quorum_peer.setInitLimit(_init_limit); _quorum_peer.setSyncLimit(_sync_limit); QuorumVerifier quorumVerifier = new QuorumMaj(servers.size()); _quorum_peer.setQuorumVerifier(quorumVerifier); _quorum_peer.setCnxnFactory(_cnxn_factory); _quorum_peer.setZKDatabase(new ZKDatabase(log)); _quorum_peer.start(); } } else { _log.debug(Starting Zookeeper server in single server mode); _zk_server = new ZooKeeperServer(); _zk_server.setTxnLogFactory(log); _zk_server.setTickTime(_tick_time); _cnxn_factory.startup(_zk_server); } {code} And here's our shutdown code: {code} if (_quorum_peer != null) { synchronized(_quorum_peer) { _quorum_peer.shutdown(); FastLeaderElection fle = (FastLeaderElection) _quorum_peer.getElectionAlg(); fle.shutdown(); try { _quorum_peer.getTxnFactory().commit(); } catch (java.nio.channels.ClosedChannelException e) { // ignore } } } else { _cnxn_factory.shutdown(); _zk_server.getTxnLogFactory().commit(); } {code} The test steps through the following scenarios in quick succession: Run 1: Start a 1-node cluster, servers=[0] Run 2: Start a 2-node cluster, servers=[0,3] Run 3: Start a 3-node cluster, servers=[0,1,3] Run 4: Start a 4-node cluster, servers=[0,1,2,3] Run 5: Start a 5-node cluster, servers=[0,1,2,3,4] It appears that run 5 never elects a leader -- the nodes just keep spewing messages like this (example from node 0): {noformat} 2012-03-14 16:23:12,775 13308 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 2 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Sending Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 3 (recipient), 0 (myid), 0x2 (n.peerEpoch) 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 3 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Sending Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 4 (recipient), 0 (myid), 0x2 (n.peerEpoch) 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 4 2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new notification message. My id = 0 2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] INFO org.apache.zookeeper.server.quorum.FastLeaderElection -
[jira] [Commented] (ZOOKEEPER-1320) Add the feature to zookeeper allow client limitations by ip.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232081#comment-13232081 ] Camille Fournier commented on ZOOKEEPER-1320: - It doesn't look like we agree that this feature is necessary and it's not applying cleanly. I'm moving this out of patch available state until you get it into more review-ready shape. Add the feature to zookeeper allow client limitations by ip. Key: ZOOKEEPER-1320 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1320 Project: ZooKeeper Issue Type: New Feature Components: server Affects Versions: 3.3.3 Environment: Linux version 2.6.18-164.el5 (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)), jdk-1.6.0_17 Reporter: Leader Ni Assignee: Leader Ni Labels: client,server,limited,ipfilter Attachments: UserGuide-1320-iplimited.docx, UserGuide-1320-iplimited.pdf, ZOOKEEPER-1320-iplimited.patch, zookeeper-3.3.3.jar_iplimited Original Estimate: 168h Remaining Estimate: 168h Add the feature to zookeeper so that administrator can set the list of ips that limit which nodes can connect to the zk servers and which connected clients can operate on data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1377) add support for dumping a snapshot file content (similar to LogFormatter)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232086#comment-13232086 ] Camille Fournier commented on ZOOKEEPER-1377: - +1 looks nice. Should we consider adding this to 3.4? I realize it's a new feature but it is also an awfully useful utility. add support for dumping a snapshot file content (similar to LogFormatter) - Key: ZOOKEEPER-1377 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1377 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Patrick Hunt Assignee: Patrick Hunt Labels: newbie Fix For: 3.5.0 Attachments: ZOOKEEPER-1377.patch, ZOOKEEPER-1377.patch We have LogFormatter but not SnapshotFormatter. I've added this, patch momentarily. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1397) Remove BookKeeper documentation links
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232091#comment-13232091 ] Camille Fournier commented on ZOOKEEPER-1397: - Somehow missed 2 files in the checkin, should be fixed now. Remove BookKeeper documentation links - Key: ZOOKEEPER-1397 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1397 Project: ZooKeeper Issue Type: Improvement Reporter: Flavio Junqueira Assignee: Flavio Junqueira Fix For: 3.5.0 Attachments: ZOOKEEPER-1397.patch BookKeeper is now a subproject and its documentation is maintained in the site of the subproject. Consequently, we should remove the links in the zookeeper documentation pages or at least point to the documentation of the subproject site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1419) Leader election never settles for a 5-node cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232097#comment-13232097 ] Camille Fournier commented on ZOOKEEPER-1419: - I don't see why this is marked to 3.3.5, the logic there does not seem to be faulty at a glance. Do we want to add a test with a 5-member quorum or do we think the unit test on the predicate logic is enough? Leader election never settles for a 5-node cluster -- Key: ZOOKEEPER-1419 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1419 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.3, 3.3.5, 3.5.0 Environment: 64-bit Linux, all nodes running on the same machine (different ports) Reporter: Jeremy Stribling Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.3.6, 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1419-fixed2.tgz, ZOOKEEPER-1419.patch, ZOOKEEPER-1419.patch, ZOOKEEPER-1419.patch We have a situation where it seems to my untrained eye that leader election never finishes for a 5-node cluster. In this test, all nodes are ZK 3.4.3 and running on the same server (listening on different ports, of course). The nodes have server IDs of 0, 1, 2, 3, 4. The test brings up the cluster in different configurations, adding in a new node each time. We embed ZK in our application, so when we shut a node down and restart it with a new configuration, it all happens in a single JVM process. Here's our server startup code (for the case where there's more than one node in the cluster): {code} if (servers.size() 1) { _log.debug(Starting Zookeeper server in quorum server mode); _quorum_peer = new QuorumPeer(); synchronized(_quorum_peer) { _quorum_peer.setClientPortAddress(clientAddr); _quorum_peer.setTxnFactory(log); _quorum_peer.setQuorumPeers(servers); _quorum_peer.setElectionType(_election_alg); _quorum_peer.setMyid(_server_id); _quorum_peer.setTickTime(_tick_time); _quorum_peer.setInitLimit(_init_limit); _quorum_peer.setSyncLimit(_sync_limit); QuorumVerifier quorumVerifier = new QuorumMaj(servers.size()); _quorum_peer.setQuorumVerifier(quorumVerifier); _quorum_peer.setCnxnFactory(_cnxn_factory); _quorum_peer.setZKDatabase(new ZKDatabase(log)); _quorum_peer.start(); } } else { _log.debug(Starting Zookeeper server in single server mode); _zk_server = new ZooKeeperServer(); _zk_server.setTxnLogFactory(log); _zk_server.setTickTime(_tick_time); _cnxn_factory.startup(_zk_server); } {code} And here's our shutdown code: {code} if (_quorum_peer != null) { synchronized(_quorum_peer) { _quorum_peer.shutdown(); FastLeaderElection fle = (FastLeaderElection) _quorum_peer.getElectionAlg(); fle.shutdown(); try { _quorum_peer.getTxnFactory().commit(); } catch (java.nio.channels.ClosedChannelException e) { // ignore } } } else { _cnxn_factory.shutdown(); _zk_server.getTxnLogFactory().commit(); } {code} The test steps through the following scenarios in quick succession: Run 1: Start a 1-node cluster, servers=[0] Run 2: Start a 2-node cluster, servers=[0,3] Run 3: Start a 3-node cluster, servers=[0,1,3] Run 4: Start a 4-node cluster, servers=[0,1,2,3] Run 5: Start a 5-node cluster, servers=[0,1,2,3,4] It appears that run 5 never elects a leader -- the nodes just keep spewing messages like this (example from node 0): {noformat} 2012-03-14 16:23:12,775 13308 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 2 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Sending Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 3 (recipient), 0 (myid), 0x2 (n.peerEpoch) 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 3 2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - Sending Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 4 (recipient), 0 (myid), 0x2 (n.peerEpoch) 2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager - There is a connection already for server 4 2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection -
[jira] [Commented] (ZOOKEEPER-1421) Support for hierarchical ACLs
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230827#comment-13230827 ] Camille Fournier commented on ZOOKEEPER-1421: - This would be very useful, agreed. Support for hierarchical ACLs - Key: ZOOKEEPER-1421 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1421 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Thomas Weise Using ZK as a service, we need to restrict access to subtrees owned by different tenants. Currently there is no support for hierarchical ACLs, so it is necessary to configure the clients not only with their parent node, but also manage the ACL for each new node created in the subtree. With support for hierarchical ACLs, duplication could be avoided and the setup of the parent nodes with ACL and subsequent control of the same could be split into a separate administrative task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220628#comment-13220628 ] Camille Fournier commented on ZOOKEEPER-1354: - This looks good, I'll check it in to trunk. AuthTest.testBadAuthThenSendOtherCommands fails intermittently -- Key: ZOOKEEPER-1354 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1354.patch I'm seeing the following intermittent failure: {noformat} junit.framework.AssertionFailedError: Should have called my watcher expected:1 but was:0 at org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} The following commit introduced this test: bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause buffer corruption issues in NIOServer. (camille via breed) +Assert.assertEquals(Should have called my watcher, +1, authFailed.get()); I think it's due to either a) the code is not waiting for the notification to be propagated, or 2) the message doesn't make it back from the server to the client prior to the socket or the clientcnxn being closed. What do you think, should I just wait for the notification to arrive? or do you think it's 2). ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220639#comment-13220639 ] Camille Fournier commented on ZOOKEEPER-1354: - Checked in to 3.4.4 and trunk AuthTest.testBadAuthThenSendOtherCommands fails intermittently -- Key: ZOOKEEPER-1354 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.4, 3.5.0 Attachments: ZOOKEEPER-1354.patch I'm seeing the following intermittent failure: {noformat} junit.framework.AssertionFailedError: Should have called my watcher expected:1 but was:0 at org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} The following commit introduced this test: bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause buffer corruption issues in NIOServer. (camille via breed) +Assert.assertEquals(Should have called my watcher, +1, authFailed.get()); I think it's due to either a) the code is not waiting for the notification to be propagated, or 2) the message doesn't make it back from the server to the client prior to the socket or the clientcnxn being closed. What do you think, should I just wait for the notification to arrive? or do you think it's 2). ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1309) Creating a new ZooKeeper client can leak file handles
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216963#comment-13216963 ] Camille Fournier commented on ZOOKEEPER-1309: - Ran tests and they all passed, so I'm gonna check this in to 3.3.5. Creating a new ZooKeeper client can leak file handles - Key: ZOOKEEPER-1309 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1309 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.4 Reporter: Daniel Lord Assignee: Daniel Lord Priority: Critical Fix For: 3.3.5 Attachments: zk-1309-1.patch, zk-1309-1.patch, zk-1309-1.patch, zk-1309-3.patch If there is an IOException thrown by the constructor of ClientCnxn then file handles are leaked because of the initialization of the Selector which is never closed. final Selector selector = Selector.open(); If there is an abnormal exit from the constructor then the Selector is not closed and file handles are leaked. You can easily see this by setting the hosts string to garbage (qwerty, asdf, etc.) and then try to open a new ZooKeeper connection. I've observed the same behavior in production when there were DNS issues where the host names of the ensemble can no longer be resolved and the application servers quickly run out of handles attempting to (re)connect to zookeeper. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1344) ZooKeeper client multi-update command is not considering the Chroot request
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216975#comment-13216975 ] Camille Fournier commented on ZOOKEEPER-1344: - I hate to say it but this patch no longer applies. Can you please regenerate so that it applies to latest trunk and 3.4 branch if necessary, so we can check it in? Thanks. ZooKeeper client multi-update command is not considering the Chroot request --- Key: ZOOKEEPER-1344 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1344 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.0 Reporter: Rakesh R Assignee: Rakesh R Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-1344-onlytestcase.patch, ZOOKEEPER-1344.1.patch, ZOOKEEPER-1344.patch For example: I have created a ZooKeeper client with subtree as 10.18.52.144:2179/apps/X. Now just generated OP command for the creation of zNode /myId. When the client creates the path /myid, the ZooKeeper server is actually be creating the path as /myid instead of creating as /apps/X/myid Expected output: zNode has to be created as /apps/X/myid -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216978#comment-13216978 ] Camille Fournier commented on ZOOKEEPER-1361: - If we're going to do all these whitespace changes, it's going to make changes to both 3.4 and trunk difficult. I am really not fond of changing all the whitespace in a file for a simple checkin. Can we get this patch generated without the whitespace changes? Leader.lead iterates over 'learners' set without proper synchronisation --- Key: ZOOKEEPER-1361 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.2 Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.5.0 Attachments: ZOOKEEPER-1361.patch This block: {code} HashSetLong followerSet = new HashSetLong(); for(LearnerHandler f : learners) followerSet.add(f.getSid()); {code} is executed without holding the lock on learners, so if there were ever a condition where a new learner was added during the initial sync phase, I'm pretty sure we'd see a concurrent modification exception. Certainly other parts of the code are very careful to lock on learners when iterating. It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, but I can't convince myself that this wouldn't introduce some correctness bugs. For example the following: Learners contains A, B, C, D Thread 1 iterates over learners, and gets as far as B. Thread 2 removes A, and adds E. Thread 1 continues iterating and sees a learner view of A, B, C, D, E This may be a bug if Thread 1 is counting the number of synced followers for a quorum count, since at no point was A, B, C, D, E a correct view of the quorum. In practice, I think this is actually ok, because I don't think ZK makes any strong ordering guarantees on learners joining or leaving (so we don't need a strong serialisability guarantee on learners) but I don't think I'll make that change for this patch. Instead I want to clean up the locking protocols on the follower / learner sets - to avoid another easy deadlock like the one we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy and then iterate over the copy rather than iterate over a locked set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216990#comment-13216990 ] Camille Fournier commented on ZOOKEEPER-1361: - I can't get it to apply to either 3.4 or trunk... Leader.lead iterates over 'learners' set without proper synchronisation --- Key: ZOOKEEPER-1361 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.2 Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.5.0 Attachments: ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch This block: {code} HashSetLong followerSet = new HashSetLong(); for(LearnerHandler f : learners) followerSet.add(f.getSid()); {code} is executed without holding the lock on learners, so if there were ever a condition where a new learner was added during the initial sync phase, I'm pretty sure we'd see a concurrent modification exception. Certainly other parts of the code are very careful to lock on learners when iterating. It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, but I can't convince myself that this wouldn't introduce some correctness bugs. For example the following: Learners contains A, B, C, D Thread 1 iterates over learners, and gets as far as B. Thread 2 removes A, and adds E. Thread 1 continues iterating and sees a learner view of A, B, C, D, E This may be a bug if Thread 1 is counting the number of synced followers for a quorum count, since at no point was A, B, C, D, E a correct view of the quorum. In practice, I think this is actually ok, because I don't think ZK makes any strong ordering guarantees on learners joining or leaving (so we don't need a strong serialisability guarantee on learners) but I don't think I'll make that change for this patch. Instead I want to clean up the locking protocols on the follower / learner sets - to avoid another easy deadlock like the one we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy and then iterate over the copy rather than iterate over a locked set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1361) Leader.lead iterates over 'learners' set without proper synchronisation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216998#comment-13216998 ] Camille Fournier commented on ZOOKEEPER-1361: - No sorry that was my mistake. Ok this is looking good I will check it in. Leader.lead iterates over 'learners' set without proper synchronisation --- Key: ZOOKEEPER-1361 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1361 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.2 Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.5.0 Attachments: ZOOKEEPER-1361-no-whitespace.patch, ZOOKEEPER-1361.patch This block: {code} HashSetLong followerSet = new HashSetLong(); for(LearnerHandler f : learners) followerSet.add(f.getSid()); {code} is executed without holding the lock on learners, so if there were ever a condition where a new learner was added during the initial sync phase, I'm pretty sure we'd see a concurrent modification exception. Certainly other parts of the code are very careful to lock on learners when iterating. It would be nice to use a {{ConcurrentHashMap}} to hold the learners instead, but I can't convince myself that this wouldn't introduce some correctness bugs. For example the following: Learners contains A, B, C, D Thread 1 iterates over learners, and gets as far as B. Thread 2 removes A, and adds E. Thread 1 continues iterating and sees a learner view of A, B, C, D, E This may be a bug if Thread 1 is counting the number of synced followers for a quorum count, since at no point was A, B, C, D, E a correct view of the quorum. In practice, I think this is actually ok, because I don't think ZK makes any strong ordering guarantees on learners joining or leaving (so we don't need a strong serialisability guarantee on learners) but I don't think I'll make that change for this patch. Instead I want to clean up the locking protocols on the follower / learner sets - to avoid another easy deadlock like the one we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy and then iterate over the copy rather than iterate over a locked set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208792#comment-13208792 ] Camille Fournier commented on ZOOKEEPER-1382: - This is a lot of change for a fix that seems to be really small. Can you put this into reviewboard for more careful review? I'm not sure we will want all the logging changes so you might want to go through and trim that stuff up before putting it up there. Thanks! Zookeeper server holds onto dead/expired session ids in the watch data structures - Key: ZOOKEEPER-1382 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.4 Reporter: Neha Narkhede Assignee: Neha Narkhede Attachments: ZOOKEEPER-1382_3.3.4.patch I've observed that zookeeper server holds onto expired session ids in the watcher data structures. The result is the wchp command reports session ids that cannot be found through cons/dump and those expired session ids sit there maybe until the server is restarted. Here are snippets from the client and the server logs that lead to this state, for one particular session id 0x134485fd7bcb26f - There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 and I'm using ZkClient to connect to the cluster From the application log - application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] [main-SendThread(223.prod:12913)] [application Session establishment complete on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, negotiated timeout = 6000 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] [main-SendThread(223.prod:12913)] [application] Client session timed out, have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing socket connection and attempting reconnect application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] [main-SendThread(226.prod:12913)] [application] Unable to reconnect to ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket connection On the leader zk, 225 - zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, timeout of 6000ms exceeded zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination for sessionid: 0x134485fd7bcb26f On the server, the client was initially connected to, 223 - zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO [CommitProcessor:1:NIOServerCnxn@1580] - Established session 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f Here are the log snippets from 226, which is the server, the client reconnected to, before getting session expired event - 2012-01-27 09:52:38,190 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367 2012-01-27 09:52:38,191 - INFO [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired 2012-01-27 09:52:38,191 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for client /172.17.136.82:49367 which had sessionid 0x134485fd7bcb26f wchp output from 226, taken on 01/30 - nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *226.*wchp* | wc -l 3 wchp output from 223, taken on 01/30 - nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *223.*wchp* | wc -l 0 cons output from 223 and 226, taken on 01/30 - nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *226.*cons* | wc -l 0 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *223.*cons* | wc -l 0 So, what seems to have happened is that the client was able to re-register the watches on the new server (226), after it got disconnected from 223, inspite of having an expired session id. In NIOServerCnxn, I saw that after suspecting that a session is expired, a server removes the cnxn and its watches from its internal data structures. But before that it allows more requests to be processed even if the session is expired - // Now that the session is ready we can start receiving packets synchronized
[jira] [Commented] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207891#comment-13207891 ] Camille Fournier commented on ZOOKEEPER-1390: - Ok, I'm pretty flexible on it. Added it also as a 3.4.X issue since it's present there as well. some expensive debug code not protected by a check for debug Key: ZOOKEEPER-1390 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Benjamin Reed Fix For: 3.5.0, 3.4.4 Attachments: ZOOKEEPER-1390.patch there is some expensive debug code in DataTree.processTxn() that formats transactions for debugging that are very expensive but are only used when errors happen and when debugging is turned on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1390) some expensive debug code not protected by a check for debug
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205508#comment-13205508 ] Camille Fournier commented on ZOOKEEPER-1390: - Do you think we might want to leave in those more descriptive debug strings but guarded by an if (LOG.isDebugEnabled())? I don't care either way but it might be useful. Otherwise this looks good to me, good catch. some expensive debug code not protected by a check for debug Key: ZOOKEEPER-1390 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1390 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Benjamin Reed Fix For: 3.5.0 Attachments: ZOOKEEPER-1390.patch there is some expensive debug code in DataTree.processTxn() that formats transactions for debugging that are very expensive but are only used when errors happen and when debugging is turned on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205864#comment-13205864 ] Camille Fournier commented on ZOOKEEPER-1321: - Great. I'm going to check this in now. Add number of client connections metric in JMX and srvr --- Key: ZOOKEEPER-1321 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4, 3.4.2 Reporter: Neha Narkhede Assignee: Neha Narkhede Labels: patch Attachments: ZK-1321-nowhitespace.patch, ZOOKEEPER-1321_3.4.patch, ZOOKEEPER-1321_trunk.patch, ZOOKEEPER-1321_trunk.patch, zk-1321-cleanup, zk-1321-trunk.patch, zk-1321.patch, zookeeper-1321-trunk-v2.patch The related conversation on the zookeeper user mailing list is here - http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user It is useful to be able to monitor the number of disconnect operations on a client. This is generally indicative of a client going through large number of GC and hence disconnecting way too often from a zookeeper cluster. Today, this information is only indirectly exposed as part of the stat command which requires counting the results. That's alot of work for the server to do just to get connection count. For monitoring purposes, it will be useful to have this exposed through JMX and 4lw srvr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1383) Create update throughput quotas and add hard quota limits
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205964#comment-13205964 ] Camille Fournier commented on ZOOKEEPER-1383: - So, in short, I'm -1 on this until it stops breaking backwards compatibility. Might consider adding the update throughput quotas separately from hard quota limits. Create update throughput quotas and add hard quota limits - Key: ZOOKEEPER-1383 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1383 Project: ZooKeeper Issue Type: New Feature Components: server Reporter: Jay Shrauner Assignee: Jay Shrauner Priority: Minor Fix For: 3.5.0 Attachments: ZOOKEEPER-1383.patch, ZOOKEEPER-1383.patch Quotas exist for size (node count and size in bytes); it would be useful to track and support quotas on update throughput (bytes per second) as well. This can be tracked on both a node/subtree level for quota support as well as on the server level for monitoring. In addition, the existing quotas log a warning when they are exceeded but allow the transaction to proceed (soft quotas). It would also be useful to support a corresponding set of hard quota limits that fail the transaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1383) Create update throughput quotas and add hard quota limits
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202969#comment-13202969 ] Camille Fournier commented on ZOOKEEPER-1383: - This change is definitely going to break backwards compatibility of clients in a major way. I'm not sure that it can go into a 3.X release unless we can make it not break backwards compatibility. Create update throughput quotas and add hard quota limits - Key: ZOOKEEPER-1383 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1383 Project: ZooKeeper Issue Type: New Feature Components: server Reporter: Jay Shrauner Assignee: Jay Shrauner Priority: Minor Fix For: 3.5.0 Attachments: ZOOKEEPER-1383.patch, ZOOKEEPER-1383.patch Quotas exist for size (node count and size in bytes); it would be useful to track and support quotas on update throughput (bytes per second) as well. This can be tracked on both a node/subtree level for quota support as well as on the server level for monitoring. In addition, the existing quotas log a warning when they are exceeded but allow the transaction to proceed (soft quotas). It would also be useful to support a corresponding set of hard quota limits that fail the transaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196168#comment-13196168 ] Camille Fournier commented on ZOOKEEPER-1367: - Are we not seeing it in 3.3? It seems to me glancing at the code that we should also be vulnerable to this there. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Assignee: Benjamin Reed Priority: Blocker Fix For: 3.4.3 Attachments: 1367-3.3.patch, ZOOKEEPER-1367-3.4.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.patch, ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195218#comment-13195218 ] Camille Fournier commented on ZOOKEEPER-1367: - {quote} On run8.log from 90.0.0.2, we can see that it adds the session (1e3516a4bb77) to the sessions list (see FileTxnSnapLog), and it got it from its own transaction log. But, the leader (90.0.0.1) supposedly knows of that session as well, otherwise it was not committed or leader election didn't select the right server. Checking the leader election notification messages, I can't see any problem. The part about the leader being aware of that session so that it can recreate it is the one we can't verify because we don't have run8.log for 90.0.0.1. {quote} Server 1 (90.0.0.1) is not the leader at the time that session is created, server 2 is the leader. Server 1 is not even in the quorum at that point, it's just after 2 has gained leadership with 3 as follower. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191396#comment-13191396 ] Camille Fournier commented on ZOOKEEPER-1366: - @Henry: I am fine with doing it as a separate ticket. I do think it's pretty trivial to rework this and get ourselves far down the road with a non-static impl, and not sure that we need to address Thread.sleep() to get a lot of mileage out of the solution. But I don't think I'll have time to rework this patch to do that so might as well do it in a separate ticket if Ted doesn't want to worry about that. Zookeeper should be tolerant of clock adjustments - Key: ZOOKEEPER-1366 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366 Project: ZooKeeper Issue Type: Bug Reporter: Ted Dunning Assignee: Ted Dunning Fix For: 3.4.3 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch If you want to wreak havoc on a ZK based system just do [date -s +1hour] and watch the mayhem as all sessions expire at once. This shouldn't happen. Zookeeper could easily know handle elapsed times as elapsed times rather than as differences between absolute times. The absolute times are subject to adjustment when the clock is set while a timer is not subject to this problem. In Java, System.currentTimeMillis() gives you absolute time while System.nanoTime() gives you time based on a timer from an arbitrary epoch. I have done this and have been running tests now for some tens of minutes with no failures. I will set up a test machine to redo the build again on Ubuntu and post a patch here for discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190832#comment-13190832 ] Camille Fournier commented on ZOOKEEPER-1366: - The test in TimerTest is missing the @Test annotation which I presume is an oversight. Zookeeper should be tolerant of clock adjustments - Key: ZOOKEEPER-1366 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366 Project: ZooKeeper Issue Type: Bug Reporter: Ted Dunning Fix For: 3.4.3 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch If you want to wreak havoc on a ZK based system just do [date -s +1hour] and watch the mayhem as all sessions expire at once. This shouldn't happen. Zookeeper could easily know handle elapsed times as elapsed times rather than as differences between absolute times. The absolute times are subject to adjustment when the clock is set while a timer is not subject to this problem. In Java, System.currentTimeMillis() gives you absolute time while System.nanoTime() gives you time based on a timer from an arbitrary epoch. I have done this and have been running tests now for some tens of minutes with no failures. I will set up a test machine to redo the build again on Ubuntu and post a patch here for discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190837#comment-13190837 ] Camille Fournier commented on ZOOKEEPER-1366: - So in general, I think this is a good patch and a very good thing for us to do. But I feel like Henry's comment is most interesting: {quote} The nice thing is that this is a small step towards a properly mockable time API in ZK, which would a) make tests much faster and b) make tests much more deterministic. There's a way to go still because of Thread.sleep and other complications, but this is a good first step. {quote} We really aren't doing all that much towards that end by replacing one static method call with another. You still can't mock that in mockito. So the only question I have here is, if we're going to touch all those places anyway, should we just be creating an actual thin object that wraps time and use non-static methods on that object to make these calls, in order to allow more mocking of timing issues in the future? Or should we save that for another patch? Zookeeper should be tolerant of clock adjustments - Key: ZOOKEEPER-1366 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366 Project: ZooKeeper Issue Type: Bug Reporter: Ted Dunning Fix For: 3.4.3 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch If you want to wreak havoc on a ZK based system just do [date -s +1hour] and watch the mayhem as all sessions expire at once. This shouldn't happen. Zookeeper could easily know handle elapsed times as elapsed times rather than as differences between absolute times. The absolute times are subject to adjustment when the clock is set while a timer is not subject to this problem. In Java, System.currentTimeMillis() gives you absolute time while System.nanoTime() gives you time based on a timer from an arbitrary epoch. I have done this and have been running tests now for some tens of minutes with no failures. I will set up a test machine to redo the build again on Ubuntu and post a patch here for discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190518#comment-13190518 ] Camille Fournier commented on ZOOKEEPER-1367: - Jeremy pretty much always brings us good bugs, Ted, I don't think he's wasting our time. Jeremy, these logs are from the point at which the cluster is running with two members and 221 doesn't have the nodes, but 222 does, correct? I'm noticing that in the log files I don't see a close session transaction for the session that created /election/zkrsm/nominee10. Just verifying, the cluster is accepting write requests and client connections successfully at the point you captured these logs right? Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190528#comment-13190528 ] Camille Fournier commented on ZOOKEEPER-1367: - So I pulled up a cluster on my local machine using these logs, and the two machines in my cluster did expire correctly all the ephemeral nodes you show in the errors. I'm going to assume that when you bring up a 2 node cluster with your setup and these data directories, you see the bad ephemeral nodes, correct? If so, can you try doing it with the latest 3.4.2 jar and see if it still happens? Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190566#comment-13190566 ] Camille Fournier commented on ZOOKEEPER-1367: - Hmmm I must be confused. I thought that the test you were running resulted in these the cluster in this setup, with the two nodes running and a third down, with these data directories. But if I start the cluster with two nodes and these data directories, the sessions immediately expire and delete those nodes. On the other hand, in the logs I don't see any evidence of session expiration for the sessions holding the ephemerals on either machine. When you get into this situation, if you bounce the cluster again with the two nodes does it fix the problem? I don't know if there's anything in 3.4.2 without checking but it seems like a worthwhile sanity check to do. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1367) Data inconsistencies and unexpired ephemeral nodes after cluster restart
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190066#comment-13190066 ] Camille Fournier commented on ZOOKEEPER-1367: - I'll take a look this weekend unless someone's on it now. Data inconsistencies and unexpired ephemeral nodes after cluster restart Key: ZOOKEEPER-1367 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.2 Environment: Debian Squeeze, 64-bit Reporter: Jeremy Stribling Priority: Blocker Fix For: 3.4.3 Attachments: ZOOKEEPER-1367.tgz In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163. I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path: {quote} [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm [nominee11] [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 {quote} However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path: {quote} [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm nominee06 nominee10 nominee11 [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee11 90.0.0.222: cZxid = 0x40027 ctime = Thu Jan 19 08:18:24 UTC 2012 mZxid = 0x40027 mtime = Thu Jan 19 08:18:24 UTC 2012 pZxid = 0x40027 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc220001 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee10 90.0.0.221: cZxid = 0x3014c ctime = Thu Jan 19 07:53:42 UTC 2012 mZxid = 0x3014c mtime = Thu Jan 19 07:53:42 UTC 2012 pZxid = 0x3014c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0xa234f4f3bc22 dataLength = 16 numChildren = 0 [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee06 90.0.0.223: cZxid = 0x20cab ctime = Thu Jan 19 08:00:30 UTC 2012 mZxid = 0x20cab mtime = Thu Jan 19 08:00:30 UTC 2012 pZxid = 0x20cab cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x5434f5074e040002 dataLength = 16 numChildren = 0 {quote} These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162). I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server. I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186975#comment-13186975 ] Camille Fournier commented on ZOOKEEPER-1321: - If one of the other committers wants take a quick glance at the cleanup patch that would be great, I can then check it in with your ok. Add number of client connections metric in JMX and srvr --- Key: ZOOKEEPER-1321 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4, 3.4.2 Reporter: Neha Narkhede Assignee: Neha Narkhede Labels: patch Attachments: ZOOKEEPER-1321_3.4.patch, ZOOKEEPER-1321_trunk.patch, ZOOKEEPER-1321_trunk.patch, zk-1321-cleanup, zookeeper-1321-trunk-v2.patch The related conversation on the zookeeper user mailing list is here - http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user It is useful to be able to monitor the number of disconnect operations on a client. This is generally indicative of a client going through large number of GC and hence disconnecting way too often from a zookeeper cluster. Today, this information is only indirectly exposed as part of the stat command which requires counting the results. That's alot of work for the server to do just to get connection count. For monitoring purposes, it will be useful to have this exposed through JMX and 4lw srvr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1358) In StaticHostProviderTest.java, testNextDoesNotSleepForZero tests that hostProvider.next(0) doesn't sleep by checking that the latency of this call is less than 10s
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186650#comment-13186650 ] Camille Fournier commented on ZOOKEEPER-1358: - This looks good to me. I will check it in. In StaticHostProviderTest.java, testNextDoesNotSleepForZero tests that hostProvider.next(0) doesn't sleep by checking that the latency of this call is less than 10sec -- Key: ZOOKEEPER-1358 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1358 Project: ZooKeeper Issue Type: Bug Reporter: Alexander Shraer Assignee: Alexander Shraer Priority: Trivial Fix For: 3.2.3 Attachments: ZOOKEEPER-1358.patch, ZOOKEEPER-1358.patch should check for something smaller, perhaps 1ms or 5ms -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1351) invalid test verification in MultiTransactionTest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186661#comment-13186661 ] Camille Fournier commented on ZOOKEEPER-1351: - Looks good to me. Will check this in. invalid test verification in MultiTransactionTest - Key: ZOOKEEPER-1351 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1351 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.3, 3.5.0 Attachments: ZOOKEEPER-1351.patch, ZOOKEEPER-1351_br34.patch tests such as org.apache.zookeeper.test.MultiTransactionTest.testWatchesTriggered() are incorrect. Two issues I see 1) zk.sync is async, there is no guarantee that the watcher will be called subsequent to sync returning {noformat} zk.sync(/, null, null); assertTrue(watcher.triggered); /// incorrect assumption {noformat} The callback needs to be implemented, only once the callback is called can we verify the trigger. 2) trigger is not declared as volatile, even though it will be set in the context of a different thread (eventthread) See https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-solaris/91/testReport/junit/org.apache.zookeeper.test/MultiTransactionTest/testWatchesTriggered/ for an example of a false positive failure {noformat} junit.framework.AssertionFailedError at org.apache.zookeeper.test.MultiTransactionTest.testWatchesTriggered(MultiTransactionTest.java:236) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1183) Enhance LogFormatter to output additional detail from transaction log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182344#comment-13182344 ] Camille Fournier commented on ZOOKEEPER-1183: - Honestly, I think you're getting a bit ambitious for this ticket. I think you should simply enhance the logformatter to a degree that makes sense, and any additional tooling either make a new ticket or perhaps a github project for the work. Enhance LogFormatter to output additional detail from transaction log - Key: ZOOKEEPER-1183 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1183 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.4.0 Reporter: kishore gopalakrishna Assignee: kishore gopalakrishna Priority: Minor Attachments: ZOOKEEPER-1183.patch Current LogFormatter prints the following information ZooKeeper Transactional Log File with dbid 0 txnlog format version 2 8/15/11 1:55:36 PM PDT session 0x131cf1a236f0014 cxid 0x0 zxid 0xf01 createSession 8/15/11 1:55:57 PM PDT session 0x131cf1a236f cxid 0x55f zxid 0xf02 setData 8/15/11 1:56:00 PM PDT session 0x131cf1a236f0015 cxid 0x0 zxid 0xf03 createSession ... .. 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001c cxid 0x36 zxid 0xf6b setData 8/15/11 2:00:33 PM PDT session 0x131cf1a236f0021 cxid 0xa1 zxid 0xf6c create 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001b cxid 0x3e zxid 0xf6d setData 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001e cxid 0x3e zxid 0xf6e setData 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001d cxid 0x41 zxid 0xf6f setData Though this is good information, it does not provide additional information like createSession: which ip created the session and its time out set|get|delete: the path and data create: path created and createmode along with data We can add additional parameter -detail and provide detailed output of the transaction. Outputting data is slightly tricky since we cant print data without understanding the format. We need not print this for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182112#comment-13182112 ] Camille Fournier commented on ZOOKEEPER-1354: - Hmmm. Let me take a look. AuthTest.testBadAuthThenSendOtherCommands fails intermittently -- Key: ZOOKEEPER-1354 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Fix For: 3.4.3, 3.5.0 I'm seeing the following intermittent failure: {noformat} junit.framework.AssertionFailedError: Should have called my watcher expected:1 but was:0 at org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} The following commit introduced this test: bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause buffer corruption issues in NIOServer. (camille via breed) +Assert.assertEquals(Should have called my watcher, +1, authFailed.get()); I think it's due to either a) the code is not waiting for the notification to be propagated, or 2) the message doesn't make it back from the server to the client prior to the socket or the clientcnxn being closed. What do you think, should I just wait for the notification to arrive? or do you think it's 2). ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1354) AuthTest.testBadAuthThenSendOtherCommands fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182118#comment-13182118 ] Camille Fournier commented on ZOOKEEPER-1354: - You're getting the AuthFailed exception, the watcher code just didn't execute fast enough, so I think it's 1. AuthTest.testBadAuthThenSendOtherCommands fails intermittently -- Key: ZOOKEEPER-1354 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1354 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Fix For: 3.4.3, 3.5.0 I'm seeing the following intermittent failure: {noformat} junit.framework.AssertionFailedError: Should have called my watcher expected:1 but was:0 at org.apache.zookeeper.test.AuthTest.testBadAuthThenSendOtherCommands(AuthTest.java:89) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} The following commit introduced this test: bq. ZOOKEEPER-1152. Exceptions thrown from handleAuthentication can cause buffer corruption issues in NIOServer. (camille via breed) +Assert.assertEquals(Should have called my watcher, +1, authFailed.get()); I think it's due to either a) the code is not waiting for the notification to be propagated, or 2) the message doesn't make it back from the server to the client prior to the socket or the clientcnxn being closed. What do you think, should I just wait for the notification to arrive? or do you think it's 2). ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1294) One of the zookeeper server is not accepting any requests
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182125#comment-13182125 ] Camille Fournier commented on ZOOKEEPER-1294: - Glancing at the code, I think you might be right. Are you planning on writing a test and a fix for this or should I? One of the zookeeper server is not accepting any requests - Key: ZOOKEEPER-1294 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1294 Project: ZooKeeper Issue Type: Bug Components: server Environment: 3 Zookeeper + 3 Observer with SuSe-11 Reporter: amith Assignee: kavita sharma In zoo.cfg i have configured as server.1 = XX.XX.XX.XX:65175:65173 server.2 = XX.XX.XX.XX:65185:65183 server.3 = XX.XX.XX.XX:65195:65193 server.4 = XX.XX.XX.XX:65205:65203:observer server.5 = XX.XX.XX.XX:65215:65213:observer server.6 = XX.XX.XX.XX:65225:65223:observer Like above I have configured 3 PARTICIPANTS and 3 OBSERVERS in the cluster of 6 zookeepers Steps to reproduce the defect 1. Start all the 3 participant zookeeper 2. Stop all the participant zookeeper 3. Start zookeeper 1(Participant) 4. Start zookeeper 2(Participant) 5. Start zookeeper 4(Observer) 6. Create a persistent node with external client and close it 7. Stop the zookeeper 1(Participant neo quorum is unstable) 8. Create a new client and try to find the node created b4 using exists api (will fail since quorum not statisfied) 9. Start the Zookeeper 1 (Participant stabilise the quorum) Now check the observer using 4 letter word (Server.4) linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat | netcat localhost 65200 Zookeeper version: 3.3.2-1031432, built on 11/05/2010 05:32 GMT Clients: /127.0.0.1:46370[0](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 1 Sent: 0 Outstanding: 0 Zxid: 0x10003 Mode: observer Node count: 5 check the participant 2 with 4 letter word Latency min/avg/max: 22/48/83 Received: 39 Sent: 3 Outstanding: 35 Zxid: 0x10003 Mode: leader Node count: 5 linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # check the participant 1 with 4 letter word linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat | netcat localhost 65170 This ZooKeeper instance is not currently serving requests We can see the participant1 logs filled with 2011-11-08 15:49:51,360 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:65170:NIOServerCnxn@642] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running Problem here is participent1 is not responding / accepting any requests -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1183) Enhance LogFormatter to output additional detail from transaction log
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182133#comment-13182133 ] Camille Fournier commented on ZOOKEEPER-1183: - Kishore, are you still interested in working on this? I'm thinking of enhancing the LogFormatter a bit more cleanly, debating whether to work on your patch or start from scratch. Enhance LogFormatter to output additional detail from transaction log - Key: ZOOKEEPER-1183 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1183 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.4.0 Reporter: kishore gopalakrishna Assignee: kishore gopalakrishna Priority: Minor Attachments: ZOOKEEPER-1183.patch Current LogFormatter prints the following information ZooKeeper Transactional Log File with dbid 0 txnlog format version 2 8/15/11 1:55:36 PM PDT session 0x131cf1a236f0014 cxid 0x0 zxid 0xf01 createSession 8/15/11 1:55:57 PM PDT session 0x131cf1a236f cxid 0x55f zxid 0xf02 setData 8/15/11 1:56:00 PM PDT session 0x131cf1a236f0015 cxid 0x0 zxid 0xf03 createSession ... .. 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001c cxid 0x36 zxid 0xf6b setData 8/15/11 2:00:33 PM PDT session 0x131cf1a236f0021 cxid 0xa1 zxid 0xf6c create 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001b cxid 0x3e zxid 0xf6d setData 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001e cxid 0x3e zxid 0xf6e setData 8/15/11 2:00:33 PM PDT session 0x131cf1a236f001d cxid 0x41 zxid 0xf6f setData Though this is good information, it does not provide additional information like createSession: which ip created the session and its time out set|get|delete: the path and data create: path created and createmode along with data We can add additional parameter -detail and provide detailed output of the transaction. Outputting data is slightly tricky since we cant print data without understanding the format. We need not print this for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176673#comment-13176673 ] Camille Fournier commented on ZOOKEEPER-1321: - Looks good modulo an unneeded import in ServerCnxnFactory. I will check this in. Add number of client connections metric in JMX and srvr --- Key: ZOOKEEPER-1321 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4, 3.4.2 Reporter: Neha Narkhede Assignee: Neha Narkhede Labels: patch Attachments: ZOOKEEPER-1321_trunk.patch The related conversation on the zookeeper user mailing list is here - http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user It is useful to be able to monitor the number of disconnect operations on a client. This is generally indicative of a client going through large number of GC and hence disconnecting way too often from a zookeeper cluster. Today, this information is only indirectly exposed as part of the stat command which requires counting the results. That's alot of work for the server to do just to get connection count. For monitoring purposes, it will be useful to have this exposed through JMX and 4lw srvr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176676#comment-13176676 ] Camille Fournier commented on ZOOKEEPER-1321: - Neha, if you want this in 3.4 will you make me a patch that applies to that branch? It's failing to apply for ZooKeeperServer. Thanks. Add number of client connections metric in JMX and srvr --- Key: ZOOKEEPER-1321 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4, 3.4.2 Reporter: Neha Narkhede Assignee: Neha Narkhede Labels: patch Attachments: ZOOKEEPER-1321_trunk.patch The related conversation on the zookeeper user mailing list is here - http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user It is useful to be able to monitor the number of disconnect operations on a client. This is generally indicative of a client going through large number of GC and hence disconnecting way too often from a zookeeper cluster. Today, this information is only indirectly exposed as part of the stat command which requires counting the results. That's alot of work for the server to do just to get connection count. For monitoring purposes, it will be useful to have this exposed through JMX and 4lw srvr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1321) Add number of client connections metric in JMX and srvr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176834#comment-13176834 ] Camille Fournier commented on ZOOKEEPER-1321: - Sounds good. Remove the TODO added in the Zab1_0Test too please! Add number of client connections metric in JMX and srvr --- Key: ZOOKEEPER-1321 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1321 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.4, 3.4.2 Reporter: Neha Narkhede Assignee: Neha Narkhede Labels: patch Attachments: ZOOKEEPER-1321_trunk.patch The related conversation on the zookeeper user mailing list is here - http://apache.markmail.org/message/4jjcmooniowwugu2?q=+list:org.apache.hadoop.zookeeper-user It is useful to be able to monitor the number of disconnect operations on a client. This is generally indicative of a client going through large number of GC and hence disconnecting way too often from a zookeeper cluster. Today, this information is only indirectly exposed as part of the stat command which requires counting the results. That's alot of work for the server to do just to get connection count. For monitoring purposes, it will be useful to have this exposed through JMX and 4lw srvr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175951#comment-13175951 ] Camille Fournier commented on ZOOKEEPER-1100: - Seems like you all think this is a non-issue, so I will mark it as resolved. Please do feel free to re-open if you see the issue again. Killed (or missing) SendThread will cause hanging threads - Key: ZOOKEEPER-1100 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Environment: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E Reporter: Gunnar Wagenknecht Fix For: 3.5.0 Attachments: ZOOKEEPER-1100.patch, ZOOKEEPER-1100.patch After investigating an issues with [hanging threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E] I noticed that any java.lang.Error might silently kill the SendThread. Without a SendThread any thread that wants to send something will hang forever. Currently nobody will recognize a SendThread that died. I think at least a state should be flipped (or flag should be set) that causes all further send attempts to fail or to re-spin the connection loop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174244#comment-13174244 ] Camille Fournier commented on ZOOKEEPER-1333: - The logic in FileTxnSnapLog changed quite a bit, and I'm not sure if the create check makes sense with the new logic or not. The create check logic was moved into DataTree, so what I made the check in FileTxnSnapLog for I'm not entirely sure. NPE in FileTxnSnapLog when restarting a cluster --- Key: ZOOKEEPER-1333 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Andrew McNair Assignee: Patrick Hunt Priority: Blocker Fix For: 3.4.2 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, test_case.diff, test_case.diff I think a NPE was created in the fix for https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely that if rc.err != Code.OK then rc.path will be null. I'm currently working on a minimal test case for the bug, I'll attach it to this issue when it's ready. java.lang.NullPointerException at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174248#comment-13174248 ] Camille Fournier commented on ZOOKEEPER-1333: - Ah ok. Yeah, so if we put the create check in, we won't get that nonode exception if the multi fails on that, would be the only potential issue with this fix that I can see. NPE in FileTxnSnapLog when restarting a cluster --- Key: ZOOKEEPER-1333 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Andrew McNair Assignee: Patrick Hunt Priority: Blocker Fix For: 3.4.2 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, test_case.diff, test_case.diff I think a NPE was created in the fix for https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely that if rc.err != Code.OK then rc.path will be null. I'm currently working on a minimal test case for the bug, I'll attach it to this issue when it's ready. java.lang.NullPointerException at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1333) NPE in FileTxnSnapLog when restarting a cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174255#comment-13174255 ] Camille Fournier commented on ZOOKEEPER-1333: - But I'm pretty sure that nonode exception itself was kind of a crazy sanity check of the we should never reach this sort. To get there you would have to be creating a child node that already exists, but with a parent that doesn't exist. So it's no surprise that we don't have a test for that case. NPE in FileTxnSnapLog when restarting a cluster --- Key: ZOOKEEPER-1333 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1333 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Andrew McNair Assignee: Patrick Hunt Priority: Blocker Fix For: 3.4.2 Attachments: ZOOKEEPER-1333.patch, ZOOKEEPER-1333.patch, test_case.diff, test_case.diff I think a NPE was created in the fix for https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Looking in DataTree.processTxn(TxnHeader header, Record txn) it seems likely that if rc.err != Code.OK then rc.path will be null. I'm currently working on a minimal test case for the bug, I'll attach it to this issue when it's ready. java.lang.NullPointerException at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:203) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:418) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:410) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1202) Prevent certain state transitions in Java client on close(); improve exception handling and enhance client testability
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172577#comment-13172577 ] Camille Fournier commented on ZOOKEEPER-1202: - I think you might just need a longer TIMEOUT for that awaitTermination... the thread can sleep for up to 1s in the sendThread run loop before trying to reconnect, so on those slow build machines you might just need a bit more wiggle room. We don't see it even trying to connect until 2s after the session was closed. Prevent certain state transitions in Java client on close(); improve exception handling and enhance client testability -- Key: ZOOKEEPER-1202 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1202 Project: ZooKeeper Issue Type: Improvement Components: java client Affects Versions: 3.4.0 Reporter: Matthias Spycher Assignee: Matthias Spycher Attachments: ZOOKEEPER-1202.patch ZooKeeper.close() doesn't force the client into a CLOSED state. While the closing flag ensures that the client will close, its state may end up in CLOSED, CONNECTING or CONNECTED. I developed a patch and in the process cleaned up a few other things primarily to enable testing of state transitions. - ClientCnxnState is new and enforces certain state transitions - ZooKeeper.isExpired() is new - ClientCnxn no longer refers to ZooKeeper, WatchManager is externalized, and ClientWatchManager includes 3 new methods - The SendThread terminates the EventThread on a call to close() via the event-of-death - Polymorphism is used to handle internal exceptions (SendIOExceptions) - The patch incorporates ZOOKEEPER-126.patch and prevents close() from blocking -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162436#comment-13162436 ] Camille Fournier commented on ZOOKEEPER-1269: - This is a reasonably big bug to just leave outstanding for this long, can someone please review this and check it in? Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Fix For: 3.5.0, 3.4.1 Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1239) add logging/stats to identify fsync stalls
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150653#comment-13150653 ] Camille Fournier commented on ZOOKEEPER-1239: - Are you sure we should be doing this timing using system.nanotime? add logging/stats to identify fsync stalls -- Key: ZOOKEEPER-1239 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1239 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1239_br33.patch, ZOOKEEPER-1239_br34.patch We don't have any logging to identify fsync stalls. It's a somewhat common occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - where outstanding requests start piling up and operational latency increases. We should have some sort of logging around this. e.g. if the fsync time exceeds some limit then log a warning, something like that. It would also be useful to publish stat information related to this. min/avg/max latency for fsync. This should also be exposed through JMX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1239) add logging/stats to identify fsync stalls
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150657#comment-13150657 ] Camille Fournier commented on ZOOKEEPER-1239: - Eh, I guess the popular consensus has changed on using nanotime for this sort of thing, so disregard my question. I'll put this in shortly. add logging/stats to identify fsync stalls -- Key: ZOOKEEPER-1239 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1239 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1239_br33.patch, ZOOKEEPER-1239_br34.patch We don't have any logging to identify fsync stalls. It's a somewhat common occurrence (after gc/swap issues) when trying to diagnose pipeline stalls - where outstanding requests start piling up and operational latency increases. We should have some sort of logging around this. e.g. if the fsync time exceeds some limit then log a warning, something like that. It would also be useful to publish stat information related to this. min/avg/max latency for fsync. This should also be exposed through JMX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149714#comment-13149714 ] Camille Fournier commented on ZOOKEEPER-1208: - Actually, I'm not sure... are these useful at all? I'd rather not see printlns in test output unless it's really useful, but in the case of this test I'm not sure I can tell... Ephemeral node not removed after the client session is long gone Key: ZOOKEEPER-1208 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.3.3 Reporter: kishore gopalakrishna Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br34.patch, ZOOKEEPER-1208_trunk.patch Copying from email thread. We found our ZK server in a state where an ephemeral node still exists after a client session is long gone. I used the cons command on each ZK host to list all connections and couldn't find the ephemeralOwner id. We are using ZK 3.3.3. Has anyone seen this problem? I got the following information from the logs. The node that still exists is /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7 I saw that the ephemeral owner is 86167322861045079 which is session id 0x13220b93e610550. After searching in the transaction log of one of the ZK servers found that session expired 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null On digging further into the logs I found that there were multiple sessions created in quick succession and every session tried to create the same node. But i verified that the sessions were closed and opened in order 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 createSession 6000 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 createSession 6000 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a closeSession null 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e createSession 6000 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 closeSession null 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 createSession 6000 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b closeSession null 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c createSession 6000 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f closeSession null 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 createSession 6000 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd closeSession null 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a closeSession null Here is the log output for the sessions that tried creating the same node 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b create
[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149829#comment-13149829 ] Camille Fournier commented on ZOOKEEPER-1208: - Committed to 3.4 and trunk, will get 3.3.4 in a second. Mahadev, feel free to cut another 3.4 RC whenever. Ephemeral node not removed after the client session is long gone Key: ZOOKEEPER-1208 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.3.3 Reporter: kishore gopalakrishna Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br34.patch, ZOOKEEPER-1208_trunk.patch Copying from email thread. We found our ZK server in a state where an ephemeral node still exists after a client session is long gone. I used the cons command on each ZK host to list all connections and couldn't find the ephemeralOwner id. We are using ZK 3.3.3. Has anyone seen this problem? I got the following information from the logs. The node that still exists is /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7 I saw that the ephemeral owner is 86167322861045079 which is session id 0x13220b93e610550. After searching in the transaction log of one of the ZK servers found that session expired 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null On digging further into the logs I found that there were multiple sessions created in quick succession and every session tried to create the same node. But i verified that the sessions were closed and opened in order 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 createSession 6000 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 createSession 6000 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a closeSession null 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e createSession 6000 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 closeSession null 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 createSession 6000 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b closeSession null 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c createSession 6000 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f closeSession null 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 createSession 6000 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd closeSession null 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a closeSession null Here is the log output for the sessions that tried creating the same node 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 Let me know if you need
[jira] [Commented] (ZOOKEEPER-1208) Ephemeral node not removed after the client session is long gone
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148517#comment-13148517 ] Camille Fournier commented on ZOOKEEPER-1208: - I like the fix, Pat. Ephemeral node not removed after the client session is long gone Key: ZOOKEEPER-1208 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1208 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.3.3 Reporter: kishore gopalakrishna Assignee: Patrick Hunt Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1208_br33.patch, ZOOKEEPER-1208_br33.patch Copying from email thread. We found our ZK server in a state where an ephemeral node still exists after a client session is long gone. I used the cons command on each ZK host to list all connections and couldn't find the ephemeralOwner id. We are using ZK 3.3.3. Has anyone seen this problem? I got the following information from the logs. The node that still exists is /kafka-tracking/consumers/UserPerformanceEvent-host/owners/UserPerformanceEvent/529-7 I saw that the ephemeral owner is 86167322861045079 which is session id 0x13220b93e610550. After searching in the transaction log of one of the ZK servers found that session expired 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null On digging further into the logs I found that there were multiple sessions created in quick succession and every session tried to create the same node. But i verified that the sessions were closed and opened in order 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x0 zxid 0x601bd36b5 createSession 6000 9/22/11 12:17:57 PM PDT session 0x13220b93e610550 cxid 0x74 zxid 0x601bd36f7 closeSession null 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x0 zxid 0x601bd36f8 createSession 6000 9/22/11 12:17:59 PM PDT session 0x13220b93e610551 cxid 0x74 zxid 0x601bd373a closeSession null 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x0 zxid 0x601bd373e createSession 6000 9/22/11 12:18:01 PM PDT session 0x13220b93e610552 cxid 0x6c zxid 0x601bd37a0 closeSession null 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x0 zxid 0x601bd37e9 createSession 6000 9/22/11 12:18:03 PM PDT session 0x13220b93e610553 cxid 0x74 zxid 0x601bd382b closeSession null 9/22/11 12:18:04 PM PDT session 0x13220b93e610554 cxid 0x0 zxid 0x601bd383c createSession 6000 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x6a zxid 0x601bd388f closeSession null 9/22/11 12:18:06 PM PDT session 0x13220b93e610555 cxid 0x0 zxid 0x601bd3895 createSession 6000 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x6a zxid 0x601bd38cd closeSession null 9/22/11 12:18:10 PM PDT session 0x13220b93e610556 cxid 0x0 zxid 0x601bd38d1 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x0 zxid 0x601bd38f2 createSession 6000 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x51 zxid 0x601bd396a closeSession null Here is the log output for the sessions that tried creating the same node 9/22/11 12:17:54 PM PDT session 0x13220b93e61054f cxid 0x42 zxid 0x601bd366b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:56 PM PDT session 0x13220b93e610550 cxid 0x42 zxid 0x601bd36ce create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:17:58 PM PDT session 0x13220b93e610551 cxid 0x42 zxid 0x601bd3711 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:00 PM PDT session 0x13220b93e610552 cxid 0x42 zxid 0x601bd3777 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:02 PM PDT session 0x13220b93e610553 cxid 0x42 zxid 0x601bd3802 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:05 PM PDT session 0x13220b93e610554 cxid 0x44 zxid 0x601bd385d create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:07 PM PDT session 0x13220b93e610555 cxid 0x44 zxid 0x601bd38b0 create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 9/22/11 12:18:11 PM PDT session 0x13220b93e610557 cxid 0x52 zxid 0x601bd396b create '/kafka-tracking/consumers/UserPerformanceEvent-hostname/owners/UserPerformanceEvent/529-7 Let me know if you need additional information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144378#comment-13144378 ] Camille Fournier commented on ZOOKEEPER-1270: - 2 acks is expected. This threw me the first time I saw it in the code, but it's right as far as I could reason looking through follower and leader, the first ack is after NEWLEADER, the second is right before we start the zk server. testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144562#comment-13144562 ] Camille Fournier commented on ZOOKEEPER-1270: - If readyToStart becomes unused with this patch can we please go ahead and remove it? testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270-and-1194.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270.patch, ZOOKEEPER-1270_br34.patch, ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz, testEarlyLeaderAbandonment3.txt.gz, testEarlyLeaderAbandonment4.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144573#comment-13144573 ] Camille Fournier commented on ZOOKEEPER-1264: - Oh now I see. Because 1192 introduced fixes into leader election that added stuff to the Zab1_0Test that I missed. Why in the world do we have leader election bugs going only into trunk instead of into 3.4 as well??? Not good. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264-branch34.patch, ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143198#comment-13143198 ] Camille Fournier commented on ZOOKEEPER-1264: - Ben, just two questions: Does this logic really only apply to FollowerZookeeperServers or should observers also do this? Why does the playing of these txns to the log come after we start the zk server instead of before? FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143542#comment-13143542 ] Camille Fournier commented on ZOOKEEPER-1270: - There's some extraneous stuff in ClientBase, but if anyone can repro this bug locally and run it with this stack tracing on that would be useful. testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143602#comment-13143602 ] Camille Fournier commented on ZOOKEEPER-1270: - It seems to me that everything comes up ok, and starts the election process, elects a leader, and gets a snapshot from the leader. But in the logs where you have 2 followers very closely synched in time (never my local box but seems to happen on the build boxes occasionally), after the followers have claimed to write a snapshot to disk (which means they must have gotten the NEWLEADER message) the whole process then stops, and you see no logs from the leader indicating it ran processAck for either follower. It feels to me like it could be a race condition in the leader somewhere, causing it to somehow miss that ACK but I can't seem to find it. Nothing in the diffs from the checkin related to ZAB1.0 seem to be much of a culprit... I'm a bit stumped but going to keep looking. testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1270) testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143690#comment-13143690 ] Camille Fournier commented on ZOOKEEPER-1270: - Looking at this some more I'm not entirely convinced it isn't a timing issue: {quote} I'm skeptical about it being a time issue because we wait 10 seconds for the waitForAll call to complete, but I'm not sure if this completely unrealistic or not assuming that the jenkins machine is overloaded. {quote} I actually have the startup and shutdown running in a loop on my box. The one time I managed to get it to fail was due to 10 seconds not being a long enough wait time. The servers were almost up, in fact, but election just took a little while as did snapshotting etc and it never succeeded. testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. - Key: ZOOKEEPER-1270 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1270 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Patrick Hunt Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1270tests.patch, ZOOKEEPER-1270tests2.patch, testEarlyLeaderAbandonment.txt.gz, testEarlyLeaderAbandonment2.txt.gz Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily. This test was introduced in the following commit (all three jira commit at once): ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs. ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader ZOOKEEPER-1082. modify leader election to correctly take into account current -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143714#comment-13143714 ] Camille Fournier commented on ZOOKEEPER-1264: - Ok, I think this is all fine. I will check this in. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264-merge.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142211#comment-13142211 ] Camille Fournier commented on ZOOKEEPER-1264: - Because when the follower writes a new log file without writing a snapshot with the old transactions, on restart the ZK thinks it has the transactions up to the zxid in the log file. The fact that these transactions were never written to a log or snapshot by the follower is not captured. We got a NEWLEADER and took a snapshot, then got a bunch of txns that went directly to our data tree, then got UPTODATE, then some other new transactions that caused the creation of a brand new log file. The intermediate transactions between NEWLEADER and UPTODATE are never written to a persistent store on the follower unless it manages to stay alive long enough to do another snapshot. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142225#comment-13142225 ] Camille Fournier commented on ZOOKEEPER-1269: - I think it should go into both, since it is a bug with multi. Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142274#comment-13142274 ] Camille Fournier commented on ZOOKEEPER-1264: - Seems to work. I want to go ahead and put in the additional changes to FollowerResyncConcurrencyTest along with your patch after I finish reviewing it. Theoretically they aren't needed but given how many times this test has caught issues I think it's worth it to double-test this stuff. Let me know if you disagree. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142373#comment-13142373 ] Camille Fournier commented on ZOOKEEPER-1264: - Yup will do asap (which might be early this evening but I'll try to get it in a few mins). FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, ZOOKEEPER-1264unittest.patch, ZOOKEEPER-1264unittest.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142392#comment-13142392 ] Camille Fournier commented on ZOOKEEPER-1246: - Thomas, a bit of feedback. This is unnecessarily aggressive and annoying, and coming after I smacked you down for not writing tests for your own bugfixes it makes you look incredibly petty and insecure. It is perfectly fair of you to point out that I added an eclipse warning (guilty as charged, but if you really care about these you need to make the build fail when additional warnings are added). And yes, the formatting is not perfect. But as to most of the rest of your points, you can frankly go to hell if you think I'm going to tolerate being condescended to in this manner. You had the opportunity to fix this bug yourself when you reported it. Instead, you pranced off to work on your own thing and left it to me to debug and provide a fix. Now that the fix is done and somehow not to your liking, the best you could hope for here is to request a fix for the warning and formatting errors, and otherwise submit a new patch as a refactor. I'm closing this back up, and you are welcome to open a new issue with formatting fixes/refactors on it if you so choose. But it is certainly not a critical bug any longer. Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Thomas Koch Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141191#comment-13141191 ] Camille Fournier commented on ZOOKEEPER-1246: - Will do. Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Camille Fournier Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141205#comment-13141205 ] Camille Fournier commented on ZOOKEEPER-1246: - Oh brilliant, yet another refactoring blew away the trunk patch here. Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Camille Fournier Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141211#comment-13141211 ] Camille Fournier commented on ZOOKEEPER-1269: - Hey guys, someone want to review and commit this? Looks like we got the OK from the multi folks. Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1136) NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141234#comment-13141234 ] Camille Fournier commented on ZOOKEEPER-1136: - This change causes a concurrency bug. Specifically: 1. Follower rejoins, gets snap from leader 2. Follower gets NEWLEADER message and takes a snapshot 3. Follower gets some additional tranactions forwarded from leader, applies these directly to data tree 4. Follower gets an UPTODATE message, does not take a snapshot 5. Follower starts following, writes some new transactions to its log, and is killed before it takes another snapshot 6. Follower restarts and gets a DIFF from the leader The transactions that came in between NEWLEADER and UPTODATE are lost because they never go anywhere but the internal data tree, and if that tree isn't snapshotted and the follower restarts with only a DIFF, the follower will lose these transactions. I think the proper thing to do is snapshot after UPTODATE, but I'm not sure why we changed this to snapshot after NEWLEADER instead. The wiki doesn't seem to explain that clearly. If one of you could check on https://issues.apache.org/jira/browse/ZOOKEEPER-1264 and let me know the reasoning, that would be helpful. NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki --- Key: ZOOKEEPER-1136 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1136 Project: ZooKeeper Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Priority: Blocker Fix For: 3.4.0 Attachments: ZOOKEEPER-1136.patch, ZOOKEEPER-1136.patch, ZOOKEEPER-1136.patch the NEW_LEADER message was sent at the beginning of the sync phase in Zab pre1.0, but it must be at the end in Zab 1.0. if the protocol is 1.0 or greater we need to queue rather than send the packet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141235#comment-13141235 ] Camille Fournier commented on ZOOKEEPER-1264: - From a comment I added to the tracker that this change was attached to: ZOOKEEPER-1136 causes a concurrency bug. Specifically: 1. Follower rejoins, gets snap from leader 2. Follower gets NEWLEADER message and takes a snapshot 3. Follower gets some additional tranactions forwarded from leader, applies these directly to data tree 4. Follower gets an UPTODATE message, does not take a snapshot 5. Follower starts following, writes some new transactions to its log, and is killed before it takes another snapshot 6. Follower restarts and gets a DIFF from the leader The transactions that came in between NEWLEADER and UPTODATE are lost because they never go anywhere but the internal data tree, and if that tree isn't snapshotted and the follower restarts with only a DIFF, the follower will lose these transactions. I think the proper thing to do is snapshot after UPTODATE, but I'm not sure why we changed this to snapshot after NEWLEADER instead. The wiki doesn't seem to explain that clearly. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141248#comment-13141248 ] Camille Fournier commented on ZOOKEEPER-1264: - Thanks Ben. The patch I attached changes both Learner and FollowerResyncConcurrencyTest. You should be able to repro the failure with testResyncBySnapThenDiffAfterFollowerCrashes pretty reliably. You can ignore the changes in Learner (just move the snap to after UPTODATE instead of NEWLEADER0. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141285#comment-13141285 ] Camille Fournier commented on ZOOKEEPER-1264: - Yeah, sorry, these concurrency tests are pretty much impossible to write deterministically without some additional scaffolding. If you look at lines 152-158 of the test, you want the thread that I started to have transactions passing through the leader when the qu.restart at 153 loads the follower. The follower should get a snapshot from the leader, a few more pending transactions, and then additional transactions that cause a log file to be written that will have a zxid that is not the zxid of the snapshot it created + 1. For example from Pat's log: 2011-10-28 17:09:56,691 [myid:] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11221:FileTxnSnapLog@255] - Snapshotting: 12322 (indicating the NEWLEADER) then 2011-10-28 17:09:59,316 [myid:] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11221:Follower@118] - Got zxid 0x12c3e expected 0x1 2011-10-28 17:09:59,330 [myid:] - INFO [SyncThread:1:FileTxnLog@195] - Creating new log file: log.12c3e FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141487#comment-13141487 ] Camille Fournier commented on ZOOKEEPER-1100: - I'm reviewing this issue. Can I get some clarity? Is the issue that you get a runtime exception outside of the try block after while (state.isAlive()) so the thread dies and hangs? Why put the try block there instead of around the entire method? Killed (or missing) SendThread will cause hanging threads - Key: ZOOKEEPER-1100 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Environment: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E Reporter: Gunnar Wagenknecht Assignee: Rakesh R Fix For: 3.5.0 Attachments: ZOOKEEPER-1100.patch After investigating an issues with [hanging threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E] I noticed that any java.lang.Error might silently kill the SendThread. Without a SendThread any thread that wants to send something will hang forever. Currently nobody will recognize a SendThread that died. I think at least a state should be flipped (or flag should be set) that causes all further send attempts to fail or to re-spin the connection loop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1100) Killed (or missing) SendThread will cause hanging threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141526#comment-13141526 ] Camille Fournier commented on ZOOKEEPER-1100: - More to the point, are you expecting just a watcher event for this? As it stands, if your send thread dies you will still have send requests hang even with a cleanup call because the state doesn't change to anything but CONNECTING. If just getting a watch event and notification on pending send requests is fine, then I think we can work with this. Killed (or missing) SendThread will cause hanging threads - Key: ZOOKEEPER-1100 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1100 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.3 Environment: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E Reporter: Gunnar Wagenknecht Assignee: Rakesh R Fix For: 3.5.0 Attachments: ZOOKEEPER-1100.patch After investigating an issues with [hanging threads|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201106.mbox/%3Citpgb6$2mi$1...@dough.gmane.org%3E] I noticed that any java.lang.Error might silently kill the SendThread. Without a SendThread any thread that wants to send something will hang forever. Currently nobody will recognize a SendThread that died. I think at least a state should be flipped (or flag should be set) that causes all further send attempts to fail or to re-spin the connection loop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139821#comment-13139821 ] Camille Fournier commented on ZOOKEEPER-1264: - Got this reproduced on my local box with yet more hacks to the test and a few sleeps in the source code. Should be close to figuring out the problem, probably tomorrow sometime. Stay tuned. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139879#comment-13139879 ] Camille Fournier commented on ZOOKEEPER-1264: - OK, I found the bug. Ben, we could use your attention here. The problem is that we queue NEWLEADER before we queue UPTODATE, but inbetween these messages we send more sync packets to move us from SNAP to, well, UPTODATE. These get written directly to the data tree, bypassing the log. But if you immediately shut down the ZK before snapshotting again, you will lose any record of these transactions on the ZK in question. It seems to me that we should either snapshot again on UPTODATE or else wait to snapshot in the first place until that packet is sent. I don't understand why we moved to snapshot on NEWLEADER in the first place. If one of the ZAB 1.0 authors could comment, that would be useful. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139348#comment-13139348 ] Camille Fournier commented on ZOOKEEPER-1269: - Right, ok. So I think the patch attached to this issue does exactly that, if someone would like to review it. What I'm not sure is whether the test I put in is particularly good, so would really appreciate one of the multi experts taking a gander there. Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138388#comment-13138388 ] Camille Fournier commented on ZOOKEEPER-1264: - This looks like a good cleanup, thanks Patrick. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138397#comment-13138397 ] Camille Fournier commented on ZOOKEEPER-1264: - Committed to trunk, 3.3.4 and 3.4 branches. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138796#comment-13138796 ] Camille Fournier commented on ZOOKEEPER-1269: - The test here is a little bit vague, because I don't really understand how a proper but broken multitxn would look. Handling the error codes in FileTxnSnapLog is also a bit fuzzy. But I think the general refactor should fix the issue. Would be great if Marshall could take a look at this to verify. Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138894#comment-13138894 ] Camille Fournier commented on ZOOKEEPER-1264: - Looking. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138963#comment-13138963 ] Camille Fournier commented on ZOOKEEPER-1264: - It might also be somewhat helpful if you could send me the txn logs from the test servers but I realize that might be too much to ask. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138984#comment-13138984 ] Camille Fournier commented on ZOOKEEPER-1269: - Are you sure about that given https://issues.apache.org/jira/browse/ZOOKEEPER-1046? Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138987#comment-13138987 ] Camille Fournier commented on ZOOKEEPER-1264: - Yeah, I spent a bit of time looking at this. I have a few ideas but it would probably go a lot faster if I had logs to examine since I can't seem to repro it myself. If you can get me some I will look more this weekend. FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1269) Multi deserialization issues
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139004#comment-13139004 ] Camille Fournier commented on ZOOKEEPER-1269: - Ah yes, being in the log is enough for it to be true were snapshots taken in a frozen system state. But since they are not, you can have these operations fail in playback due to concurrency issues. Multi isn't a special case above the other zk ops, they all have this potential race. Multi deserialization issues Key: ZOOKEEPER-1269 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1269 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.0 Reporter: Camille Fournier Attachments: ZOOKEEPER-1269.patch From the mailing list: FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE. Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed? We don't know, since the first failed transaction hides the information about the remaining transactions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139019#comment-13139019 ] Camille Fournier commented on ZOOKEEPER-1264: - Thanks Patrick. My suspicions were true, the failing zk has a chunk missing out of its logs that corresponds to the missing ephemeral nodes (snapshot snapshot.12322, log log.12c3e, but the earlier log file doesn't have txns between 2322 and 2c3e, they seem to just be missing). Now to figure out why it doesn't have those log files... FollowerResyncConcurrencyTest failing intermittently Key: ZOOKEEPER-1264 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.3.3, 3.4.0, 3.5.0 Reporter: Patrick Hunt Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip The FollowerResyncConcurrencyTest test is failing intermittently. saw the following on 3.4: {noformat} junit.framework.AssertionFailedError: Should have same number of ephemerals in both followers expected:11741 but was:14001 at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400) at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136308#comment-13136308 ] Camille Fournier commented on ZOOKEEPER-1246: - Thanks for migrating this to trunk, Patrick! Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Camille Fournier Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1246.patch, ZOOKEEPER-1246.patch, ZOOKEEPER-1246_trunk.patch, ZOOKEEPER-1246_trunk.patch This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1248) multi transaction sets request.exception without reason
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135362#comment-13135362 ] Camille Fournier commented on ZOOKEEPER-1248: - It's those damned read-only mode tests that seem to be so buggy that are failing. Do we think this failure is meaningful or not? multi transaction sets request.exception without reason --- Key: ZOOKEEPER-1248 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1248 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Assignee: Thomas Koch Attachments: ZOOKEEPER-1248.patch, ZOOKEEPER-1248.patch I'm trying to understand the purpose of the exception field in request. This isn't made easier by the fact that the multi case in PrepRequestProcessor sets the exception without reason. The only code that calls request.getException() is in FinalRequestProcessor and this code only acts when the operation _is not_ a multi operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135376#comment-13135376 ] Camille Fournier commented on ZOOKEEPER-1246: - Ok, after a bit of looking it looks like what we need to do is catch IOException and appropriately raise that as a marshalling error. I am going to see what I can do to get a test for this. Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Priority: Blocker Fix For: 3.4.0, 3.5.0 This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1246) Dead code in PrepRequestProcessor catch Exception block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135410#comment-13135410 ] Camille Fournier commented on ZOOKEEPER-1246: - Formatting may be wack and I haven't gone over it with a fine tooth comb but I think this patch takes care of it. Dead code in PrepRequestProcessor catch Exception block --- Key: ZOOKEEPER-1246 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1246 Project: ZooKeeper Issue Type: Sub-task Reporter: Thomas Koch Priority: Blocker Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-1246.patch This is a regression introduced by ZOOKEEPER-965 (multi transactions). The catch(Exception e) block in PrepRequestProcessor.pRequest contains an if block with condition request.getHdr() != null. This condition will always evaluate to false since the changes in ZOOKEEPER-965. This is caused by a change in sequence: Before ZK-965, the txnHeader was set _before_ the deserialization of the request. Afterwards the deserialization happens before request.setHdr is set. So the following RequestProcessors won't see the request as a failed one but as a Read request, since it doesn't have a hdr set. Notes: - it is very bad practice to catch Exception. The block should rather catch IOException - The check whether the TxnHeader is set in the request is used at several places to see whether the request is a read or write request. It isn't obvious for a newby, what it means whether a request has a hdr set or not. - at the beginning of pRequest the hdr and txn of request are set to null. However there is no chance that these fields could ever not be null at this point. The code however suggests that this could be the case. There should rather be an assertion that confirms that these fields are indeed null. The practice of doing things just in case, even if there is no chance that this case could happen, is a very stinky code smell and means that the code isn't understandable or trustworthy. - The multi transaction switch case block in pRequest is very hard to read, because it missuses the request.{hdr|txn} fields as local variables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1243) New 4lw for short simple monitoring ldck
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134493#comment-13134493 ] Camille Fournier commented on ZOOKEEPER-1243: - Added html docs, removed println in test. Can someone please review this? We've been suffering heavily from ZOOKEEPER-1197 and I would really appreciate it if we could get this in to 3.4 New 4lw for short simple monitoring ldck Key: ZOOKEEPER-1243 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1243 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3, 3.4.0 Reporter: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0 Attachments: ZOOKEEPER-1243-2, ZOOKEEPER-1243-4.patch, ZOOKEEPER-1243.patch The existing monitoring fails so often due to https://issues.apache.org/jira/browse/ZOOKEEPER-1197 that we need a workaround. This introduces a short 4lw called ldck that just runs ServerStats.toString to get information about the sever's leadership status. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1243) New 4lw for short simple monitoring ldck
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134558#comment-13134558 ] Camille Fournier commented on ZOOKEEPER-1243: - Oh, you are right, I thought it was weird we didn't have this. Why we chose to put the srvr command in the same command thread as stat with the only differentiator be a guarding if statement... Ok I will close this, thanks. New 4lw for short simple monitoring ldck Key: ZOOKEEPER-1243 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1243 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3, 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0 Attachments: ZOOKEEPER-1243-2, ZOOKEEPER-1243-4.patch, ZOOKEEPER-1243.patch The existing monitoring fails so often due to https://issues.apache.org/jira/browse/ZOOKEEPER-1197 that we need a workaround. This introduces a short 4lw called ldck that just runs ServerStats.toString to get information about the sever's leadership status. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1243) New 4lw for short simple monitoring ldck
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134564#comment-13134564 ] Camille Fournier commented on ZOOKEEPER-1243: - Indeed... put it on the todo list. New 4lw for short simple monitoring ldck Key: ZOOKEEPER-1243 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1243 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3, 3.4.0 Reporter: Camille Fournier Assignee: Camille Fournier Priority: Blocker Fix For: 3.3.4, 3.4.0 Attachments: ZOOKEEPER-1243-2, ZOOKEEPER-1243-4.patch, ZOOKEEPER-1243.patch The existing monitoring fails so often due to https://issues.apache.org/jira/browse/ZOOKEEPER-1197 that we need a workaround. This introduces a short 4lw called ldck that just runs ServerStats.toString to get information about the sever's leadership status. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1237) ERRORs being logged when queued responses are sent after socket has closed.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13132146#comment-13132146 ] Camille Fournier commented on ZOOKEEPER-1237: - Why do we ignore that exception in sendBuffer, instead of closing the connection at that point? ERRORs being logged when queued responses are sent after socket has closed. --- Key: ZOOKEEPER-1237 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1237 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.3.4, 3.4.0, 3.5.0 Reporter: Patrick Hunt Fix For: 3.3.4, 3.4.0, 3.5.0 After applying ZOOKEEPER-1049 to 3.3.3 (I believe the same problem exists in 3.4/3.5 but haven't tested this) I'm seeing the following exception more frequently: {noformat} Oct 19, 1:31:53 PM ERROR Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) {noformat} This is a long standing problem where we try to send a response after the socket has been closed. Prior to ZOOKEEPER-1049 this issues happened much less frequently (2 sec linger), but I believe it was possible. The timing window is just wider now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira