[jira] Created: (ZOOKEEPER-890) C client invokes watcher callbacks multiple times
C client invokes watcher callbacks multiple times - Key: ZOOKEEPER-890 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-890 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1 Environment: Mac OS X 10.6.5 Reporter: Austin Shoemaker Priority: Critical Attachments: watcher_twice.c The collect_session_watchers function in zk_hashtable.c gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing the watchers from the table. Please see attached repro case and patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-890) C client invokes watcher callbacks multiple times
[ https://issues.apache.org/jira/browse/ZOOKEEPER-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Shoemaker updated ZOOKEEPER-890: --- Attachment: watcher_twice.c C client invokes watcher callbacks multiple times - Key: ZOOKEEPER-890 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-890 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1 Environment: Mac OS X 10.6.5 Reporter: Austin Shoemaker Priority: Critical Attachments: watcher_twice.c The collect_session_watchers function in zk_hashtable.c gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing the watchers from the table. Please see attached repro case and patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-890) C client invokes watcher callbacks multiple times
[ https://issues.apache.org/jira/browse/ZOOKEEPER-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Shoemaker updated ZOOKEEPER-890: --- Attachment: ZOOKEEPER-890.patch Patch that clears active watcher sets when broadcasting a session event to all watchers. C client invokes watcher callbacks multiple times - Key: ZOOKEEPER-890 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-890 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1 Environment: Mac OS X 10.6.5 Reporter: Austin Shoemaker Priority: Critical Attachments: watcher_twice.c, ZOOKEEPER-890.patch The collect_session_watchers function in zk_hashtable.c gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing the watchers from the table. Please see attached repro case and patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-890) C client invokes watcher callbacks multiple times
[ https://issues.apache.org/jira/browse/ZOOKEEPER-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Shoemaker updated ZOOKEEPER-890: --- Description: Code using the C client assumes that watcher callbacks are called exactly once. If the watcher is called more than once, the process will likely overwrite freed memory and/or crash. collect_session_watchers (zk_hashtable.c) gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing them. This results in watchers being invoked more than once. Test code is attached that reproduces the bug, along with a proposed patch. was: The collect_session_watchers function in zk_hashtable.c gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing the watchers from the table. Please see attached repro case and patch. C client invokes watcher callbacks multiple times - Key: ZOOKEEPER-890 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-890 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1 Environment: Mac OS X 10.6.5 Reporter: Austin Shoemaker Priority: Critical Attachments: watcher_twice.c, ZOOKEEPER-890.patch Code using the C client assumes that watcher callbacks are called exactly once. If the watcher is called more than once, the process will likely overwrite freed memory and/or crash. collect_session_watchers (zk_hashtable.c) gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing them. This results in watchers being invoked more than once. Test code is attached that reproduces the bug, along with a proposed patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete
[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918878#action_12918878 ] Hudson commented on ZOOKEEPER-822: -- Integrated in ZooKeeper-trunk #959 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/959/]) ZOOKEEPER-822. Leader election taking a long time to complete Leader election taking a long time to complete --- Key: ZOOKEEPER-822 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.0 Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch_v1 Created a 3 node cluster. 1 Fail the ZK leader 2. Let leader election finish. Restart the leader and let it join the 3. Repeat After a few rounds leader election takes anywhere 25- 60 seconds to finish. Note- we didn't have any ZK clients and no new znodes were created. zoo.cfg is shown below: #Mon Jul 19 12:15:10 UTC 2010 server.1=192.168.4.12\:2888\:3888 server.0=192.168.4.11\:2888\:3888 clientPort=2181 dataDir=/var/zookeeper syncLimit=2 server.2=192.168.4.13\:2888\:3888 initLimit=5 tickTime=2000 I have attached logs from two nodes that took a long time to form the cluster after failing the leader. The leader was down anyways so logs from that node shouldn't matter. Look for START HERE. Logs after that point should be of our interest. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-844) handle auth failure in java client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918879#action_12918879 ] Hudson commented on ZOOKEEPER-844: -- Integrated in ZooKeeper-trunk #959 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/959/]) ZOOKEEPER-844. handle auth failure in java client handle auth failure in java client -- Key: ZOOKEEPER-844 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-844 Project: Zookeeper Issue Type: Bug Components: java client Affects Versions: 3.3.1 Reporter: Camille Fournier Assignee: Camille Fournier Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-844.patch, ZOOKEEPER332-844 ClientCnxn.java currently has the following code: if (replyHdr.getXid() == -4) { // -2 is the xid for AuthPacket // TODO: process AuthPacket here if (LOG.isDebugEnabled()) { LOG.debug(Got auth sessionid:0x + Long.toHexString(sessionId)); } return; } Auth failures appear to cause the server to disconnect but the client never gets a proper state change or notification that auth has failed, which makes handling this scenario very difficult as it causes the client to go into a loop of sending bad auth, getting disconnected, trying to reconnect, sending bad auth again, over and over. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918940#action_12918940 ] Alexandre Hardy commented on ZOOKEEPER-885: --- {{maxClientCnxns}} is set to 30, so 45 clients spread across 3 servers should not be unreasonable, and I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with {{dd}}. I'm aiming for 0 disconnects with this simple example. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: WatcherTest.java A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Hardy updated ZOOKEEPER-885: -- Attachment: zklogs.tar.gz Attached are logs from the two sessions with disconnects. I have not filtered the logs in any way. The logs for 3.3.1 are the most clear, and have exactly one failure roughly 3 minutes after the logs start. Unfortunately the logs don't offer much information (as far as I can make out). Should I enable more verbose logging? Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: znode inconsistencies across ZooKeeper servers
Vishal, this sounds like a bug in ZK to me. Can you create a JIRA with this description, your configuration files from all servers, and the log files from all servers during the time of the incident? If you could run the servers in DEBUG level logging during the time you reproduce the issue that would probably help: https://issues.apache.org/jira/browse/ZOOKEEPER Thanks! Patrick On Wed, Oct 6, 2010 at 2:57 PM, Vishal K vishalm...@gmail.com wrote: Hi Patrick, You are correct, the test restarts both ZooKeeper server and the client. The client opens a new connection after restarting. So we would expect that the ephmeral znode (/foo) to expire after the session timeout. However, the client with the new session creates the ephemeral znode (/foo) again after it reboots (it sets a watch for /foo and recreates /foo if it is deleted or doesn't exist). The client is not reusing the session ID. What I expect to see is that the older /foo should expire after which a new /foo should get created. Is my expectation correct? What confuses me is the following output of 3 successive getstat /foo requests on A (the zxid, time and owner fields). Notice that the older znode reappeared. At the same time when I do getstat at B and C, I see the newer /foo. log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x10607 ctime = Tue Oct 05 15:01:07 UTC 2010 mZxid = 0x10607 mtime = Tue Oct 05 15:01:07 UTC 2010 pZxid = 0x10607 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce5bda4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 Thanks for your help. -Vishal On Wed, Oct 6, 2010 at 4:45 PM, Patrick Hunt ph...@apache.org wrote: Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the session timeout period? At which time any znodes created during that session are then deleted. The fact that you are killing your client process leads me to believe that you are not closing the session cleanly (meaning that it will eventually expire after the session timeout period), in which case the ephemeral znodes _should_ reappear when A is restarted and successfully rejoins the cluster. (at least until the session timeout is exceeded) Patrick On Tue, Oct 5, 2010 at 11:04 AM, Vishal K vishalm...@gmail.com wrote: Hi, I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I have a ZK client running that connects to the local server and creates an ephemeral znode to indicate clients on other nodes that it is online. I have test script that reboots the zookeeper server as well as client on A. The test does a getstat on the ephemeral znode created by the client on A. I am seeing that the view of znodes on A is different from the other 2 nodes. I can tell this from the session ID that the client gets after reconnecting to the local ZK server. So the test is simple: - kill zookeeper server and client process - wait for a few seconds - do zkCli.sh stat ... test.out What I am seeing is that the ephemeral znode with old zxid, time, and session ID is reappearing on node A. I have attached the output of 3 consecutive getstat requests of the test (see client_getstat.out). Notice that the third output is the same as the first one. That is, the old ephemeral znode reappeared at A. However, both B and C are showing the latest znode with correct time, zxid and session ID (output not attached). After this point, all following getstat requests on A are showing the old znode. Whereas, B and C show the correct znode every time the client on A comes online. This is something very perplexing. Earlier I thought this was a bug in my client implementation. But the test shows that the ZK server on A after reboot is out of sync with rest of the servers.
Re: znode inconsistencies across ZooKeeper servers
Sure, I will reproduce it with debug enabled and create a JIRA. Thanks. On Thu, Oct 7, 2010 at 12:23 PM, Patrick Hunt ph...@apache.org wrote: Vishal, this sounds like a bug in ZK to me. Can you create a JIRA with this description, your configuration files from all servers, and the log files from all servers during the time of the incident? If you could run the servers in DEBUG level logging during the time you reproduce the issue that would probably help: https://issues.apache.org/jira/browse/ZOOKEEPER Thanks! Patrick On Wed, Oct 6, 2010 at 2:57 PM, Vishal K vishalm...@gmail.com wrote: Hi Patrick, You are correct, the test restarts both ZooKeeper server and the client. The client opens a new connection after restarting. So we would expect that the ephmeral znode (/foo) to expire after the session timeout. However, the client with the new session creates the ephemeral znode (/foo) again after it reboots (it sets a watch for /foo and recreates /foo if it is deleted or doesn't exist). The client is not reusing the session ID. What I expect to see is that the older /foo should expire after which a new /foo should get created. Is my expectation correct? What confuses me is the following output of 3 successive getstat /foo requests on A (the zxid, time and owner fields). Notice that the older znode reappeared. At the same time when I do getstat at B and C, I see the newer /foo. log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x10607 ctime = Tue Oct 05 15:01:07 UTC 2010 mZxid = 0x10607 mtime = Tue Oct 05 15:01:07 UTC 2010 pZxid = 0x10607 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce5bda4 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x105ef ctime = Tue Oct 05 15:00:50 UTC 2010 mZxid = 0x105ef mtime = Tue Oct 05 15:00:50 UTC 2010 pZxid = 0x105ef cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x2b7ce57ce4 dataLength = 54 numChildren = 0 Thanks for your help. -Vishal On Wed, Oct 6, 2010 at 4:45 PM, Patrick Hunt ph...@apache.org wrote: Vishal the attachment seems to be getting removed by the list daemon (I don't have it), can you create a JIRA and attach? Also this is a good question for the ppl on zookeeper-user. (ccing) You are aware that ephemeral znodes are tied to the session? And that sessions only expire after the session timeout period? At which time any znodes created during that session are then deleted. The fact that you are killing your client process leads me to believe that you are not closing the session cleanly (meaning that it will eventually expire after the session timeout period), in which case the ephemeral znodes _should_ reappear when A is restarted and successfully rejoins the cluster. (at least until the session timeout is exceeded) Patrick On Tue, Oct 5, 2010 at 11:04 AM, Vishal K vishalm...@gmail.com wrote: Hi, I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I have a ZK client running that connects to the local server and creates an ephemeral znode to indicate clients on other nodes that it is online. I have test script that reboots the zookeeper server as well as client on A. The test does a getstat on the ephemeral znode created by the client on A. I am seeing that the view of znodes on A is different from the other 2 nodes. I can tell this from the session ID that the client gets after reconnecting to the local ZK server. So the test is simple: - kill zookeeper server and client process - wait for a few seconds - do zkCli.sh stat ... test.out What I am seeing is that the ephemeral znode with old zxid, time, and session ID is reappearing on node A. I have attached the output of 3 consecutive getstat requests of the test (see client_getstat.out). Notice that the third output is the same as the first one. That is, the old ephemeral znode reappeared at A. However, both B and C are showing the latest znode with correct time, zxid and session ID (output not attached). After this point, all following getstat requests on A are showing the old znode. Whereas, B and C show the correct znode every time the client on A comes online. This is something very perplexing. Earlier I thought
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918984#action_12918984 ] Patrick Hunt commented on ZOOKEEPER-885: bq. I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd. I see, I was just trying to reduce variables. that should be fine then. I see this in the logs: 2010-10-07 14:49:13,956 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.version=1.6.0_0 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.vendor=Sun Microsystems Inc. 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.home=/usr/lib/jvm/java-6-openjdk/jre I'm not sure many users are running openjdk, also 1.6.0_0 is very old (I have 1.6.0_18 openjdk on my system). You should upgrade to a recent version of openjdk at the least, although I'd highly suggest running with the official (and recent) sun jdk. (again, this is to reduce variables) Also I noticed this in the server log for 1 server, it seems to be misconfigured, perhaps you can fix that? (normal_3.3.1/192.168.131.12.log) 2010-10-07 14:49:13,979 - FATAL [main:quorumpeerm...@83] - Invalid config, exiting abnormally bq. Should I enable more verbose logging? Yes, give that a try, perhaps run with TRACE logging turned on. If you can upload one of those logs I'll take a look. Right now we have this in the server log: 2010-10-07 14:51:32,961 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x22b872ad9ff000c, likely client has closed socket 2010-10-07 14:51:32,962 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for client /10.23.4.95:59738 which had sessionid 0x22b872ad9ff000c This indicates that the client is closing the connection (EOS). Please capture the logs on your client and upload one of them. Perhaps run that at DEBUG level as well. That will give us more insight into why the client is closing it's side of the connection (at least from the server's perspective). Thanks for the help on this! Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918984#action_12918984 ] Patrick Hunt edited comment on ZOOKEEPER-885 at 10/7/10 1:34 PM: - bq. I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd. I see, I was just trying to reduce variables. that should be fine then. I see this in the logs: 2010-10-07 14:49:13,956 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.version=1.6.0_0 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.vendor=Sun Microsystems Inc. 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.home=/usr/lib/jvm/java-6-openjdk/jre I'm not sure many users are running openjdk, also 1.6.0_0 is very old (I have 1.6.0_18 openjdk on my system). You should upgrade to a recent version of openjdk at the least, although I'd highly suggest running with the official (and recent) sun jdk. (again, this is to reduce variables) Also I noticed this in the server log for 1 server, it seems to be misconfigured, perhaps you can fix that? (normal_3.3.1/192.168.131.12.log) 2010-10-07 14:49:13,979 - FATAL [main:quorumpeerm...@83] - Invalid config, exiting abnormally bq. Should I enable more verbose logging? Yes, give that a try, perhaps run with TRACE logging turned on. If you can upload one of those logs I'll take a look. Right now we have this in the server log: 2010-10-07 14:51:32,961 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x22b872ad9ff000c, likely client has closed socket 2010-10-07 14:51:32,962 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for client /10.23.4.95:59738 which had sessionid 0x22b872ad9ff000c This indicates that the client is closing the connection (EOS). Please capture the logs on your client and upload one of them. Perhaps run that at TRACE level as well. That will give us more insight into why the client is closing it's side of the connection (at least from the server's perspective). Thanks for the help on this! was (Author: phunt): bq. I do have confirmation that a session is established for every client (all 45 of them) before beginning the disk load with dd. I see, I was just trying to reduce variables. that should be fine then. I see this in the logs: 2010-10-07 14:49:13,956 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.version=1.6.0_0 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.vendor=Sun Microsystems Inc. 2010-10-07 14:49:13,960 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.home=/usr/lib/jvm/java-6-openjdk/jre I'm not sure many users are running openjdk, also 1.6.0_0 is very old (I have 1.6.0_18 openjdk on my system). You should upgrade to a recent version of openjdk at the least, although I'd highly suggest running with the official (and recent) sun jdk. (again, this is to reduce variables) Also I noticed this in the server log for 1 server, it seems to be misconfigured, perhaps you can fix that? (normal_3.3.1/192.168.131.12.log) 2010-10-07 14:49:13,979 - FATAL [main:quorumpeerm...@83] - Invalid config, exiting abnormally bq. Should I enable more verbose logging? Yes, give that a try, perhaps run with TRACE logging turned on. If you can upload one of those logs I'll take a look. Right now we have this in the server log: 2010-10-07 14:51:32,961 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x22b872ad9ff000c, likely client has closed socket 2010-10-07 14:51:32,962 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for client /10.23.4.95:59738 which had sessionid 0x22b872ad9ff000c This indicates that the client is closing the connection (EOS). Please capture the logs on your client and upload one of them. Perhaps run that at DEBUG level as well. That will give us more insight into why the client is closing it's side of the connection (at least from the server's perspective). Thanks for the help on this! Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2 Environment: Debian (Lenny) 1Gb RAM swap disabled
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Attachment: ZOOKEEPER-823.patch I may have fixed another issue: I wrapped sendThread.readResponse(incomingBuffer) into a synchronization on the OutgoingQueue, because it might happen otherwise, that a package is send over netty and processed by the server, but not yet added to the pendingQueue. This fix solved all the Heisenbugs I saw. However there's still a bug with ASyncHammer and that the wait to join threads times out. I added more Debugging information. The Thread that times out hangs on ClientCnxnSocketNetty.wakeupCnxn where it waits for the synchronized(outgoingQueue). It seems that the outgoingQueue is already owned and blocked in the doWrites method, hanging on write.awaitUninterruptibly(). doWrites is called by doTransport where the synchronized(outgoingQueue) happens. update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: NettyNettySuiteTest.rtf, TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Status: Patch Available (was: Open) update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: NettyNettySuiteTest.rtf, TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Status: Open (was: Patch Available) update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: NettyNettySuiteTest.rtf, TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Attachment: ZOOKEEPER-823.patch I did another version of the patch with an example how I'd solve the deadlock mentioned in my last comment. I made the synchronized blocks in doTransport and doWrites smaller. update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: NettyNettySuiteTest.rtf, TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Attachment: (was: ZOOKEEPER-823.patch) update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: NettyNettySuiteTest.rtf, TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections
[ https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-823: -- Attachment: ZOOKEEPER-823.patch update ZooKeeper java client to optionally use Netty for connections Key: ZOOKEEPER-823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823 Project: Zookeeper Issue Type: New Feature Components: java client Reporter: Patrick Hunt Assignee: Patrick Hunt Fix For: 3.4.0 Attachments: NettyNettySuiteTest.rtf, TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch This jira will port the client side connection code to use netty rather than direct nio. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-890) C client invokes watcher callbacks multiple times
[ https://issues.apache.org/jira/browse/ZOOKEEPER-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12919087#action_12919087 ] Jared Cantwell commented on ZOOKEEPER-890: -- I don't believe the C-client makes the guarantee that watcher callbacks are called exactly once. Callbacks are called for different reasons, including: - connection lost event - connection reestablished event - session lost event - data changed event Only the last two events make the guarantee about being called exactly once, but the first two connection events can be called numerous times until either one of the last two events happens. I may be missing some events, but that's the general idea. Bottom line is the callback can receive events of type ZOO_SESSION_EVENT multiple times. I believe this was by design. C client invokes watcher callbacks multiple times - Key: ZOOKEEPER-890 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-890 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1 Environment: Mac OS X 10.6.5 Reporter: Austin Shoemaker Priority: Critical Attachments: watcher_twice.c, ZOOKEEPER-890.patch Code using the C client assumes that watcher callbacks are called exactly once. If the watcher is called more than once, the process will likely overwrite freed memory and/or crash. collect_session_watchers (zk_hashtable.c) gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing them. This results in watchers being invoked more than once. Test code is attached that reproduces the bug, along with a proposed patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-890) C client invokes watcher callbacks multiple times
[ https://issues.apache.org/jira/browse/ZOOKEEPER-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12919094#action_12919094 ] Austin Shoemaker commented on ZOOKEEPER-890: That sounds like a good design. Perhaps it could be clarified in the documentation? http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperProgrammers.html#ch_zkWatches If this is correct behavior then the Python client needs to be fixed to not delete the watcher on session events. Will file a separate bug on that. C client invokes watcher callbacks multiple times - Key: ZOOKEEPER-890 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-890 Project: Zookeeper Issue Type: Bug Components: c client Affects Versions: 3.3.1 Environment: Mac OS X 10.6.5 Reporter: Austin Shoemaker Priority: Critical Attachments: watcher_twice.c, ZOOKEEPER-890.patch Code using the C client assumes that watcher callbacks are called exactly once. If the watcher is called more than once, the process will likely overwrite freed memory and/or crash. collect_session_watchers (zk_hashtable.c) gathers watchers from active_node_watchers, active_exist_watchers, and active_child_watchers without removing them. This results in watchers being invoked more than once. Test code is attached that reproduces the bug, along with a proposed patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-888) c-client / zkpython: Double free corruption on node watcher
[ https://issues.apache.org/jira/browse/ZOOKEEPER-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Austin Shoemaker updated ZOOKEEPER-888: --- Attachment: ZOOKEEPER-888.patch Path that prevents freeing a watcher in response to a session event, per the feedback in ZOOKEEPER-890. c-client / zkpython: Double free corruption on node watcher --- Key: ZOOKEEPER-888 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-888 Project: Zookeeper Issue Type: Bug Components: c client, contrib-bindings Affects Versions: 3.3.1 Reporter: Lukas Priority: Critical Fix For: 3.3.2, 3.4.0 Attachments: resume-segfault.py, ZOOKEEPER-888.patch the c-client / zkpython wrapper invokes already freed watcher callback steps to reproduce: 0. start a zookeper server on your machine 1. run the attached python script 2. suspend the zookeeper server process (e.g. using `pkill -STOP -f org.apache.zookeeper.server.quorum.QuorumPeerMain` ) 3. wait until the connection and the node observer fired with a session event 4. resume the zookeeper server process (e.g. using `pkill -CONT -f org.apache.zookeeper.server.quorum.QuorumPeerMain` ) - the client tries to dispatch the node observer function again, but it was already freed - double free corruption -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-891) Allow non-numeric version strings
Allow non-numeric version strings - Key: ZOOKEEPER-891 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-891 Project: Zookeeper Issue Type: Improvement Components: build Reporter: Eli Collins Priority: Minor Fix For: 3.4.0, 4.0.0 Non-numeric version strings (eg -dev) or -are not currently accepted, you either get an error (Invalid version number format, must be x.y.z) or if you pass x.y.z-dev or x.y.z+1 you'll get a NumberFormatException. Would be useful to allow non-numeric versions. {noformat} version-info: [java] All version-related parameters must be valid integers! [java] Exception in thread main java.lang.NumberFormatException: For input string: 3-dev [java] at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) [java] at java.lang.Integer.parseInt(Integer.java:458) [java] at java.lang.Integer.parseInt(Integer.java:499) [java] at org.apache.zookeeper.version.util.VerGen.main(VerGen.java:131) [java] Java Result: 1 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.