[jira] Commented: (ZOOKEEPER-912) ZooKeeper client logs trace and debug messages at level INFO
[ https://issues.apache.org/jira/browse/ZOOKEEPER-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925719#action_12925719 ] Patrick Hunt commented on ZOOKEEPER-912: bq. Zookeeper logs nearly everything at least at level info, regardless of severity. that's incorrect. I did a quick grep for logging in the main src and see the following: egrep -R LOG\.error src/java/main/. |wc -l 78 egrep -R LOG\.warn src/java/main/. |wc -l 175 egrep -R LOG\.info src/java/main/. |wc -l 127 egrep -R LOG\.debug src/java/main/. |wc -l 114 egrep -R LOG\.trace src/java/main/. |wc -l 28 So actually we log mostly at WARN severity. Perhaps you think this because you mainly see INFO messages, but that's to be expected (typically things work, we only log WARN/ERROR when bad things happen). I didn't say anything about all/nothing. Check the code, we have a number of messages at various levels, incl trace/debug. If you don't want to see the informational messages for a particular class you can configure that. As I pointed out earlier: http://hadoop.apache.org/zookeeper/docs/current/zookeeperInternals.html#sc_logging We consider both the messages you listed to be informational given that we expect/recover from the second. ZooKeeper client logs trace and debug messages at level INFO Key: ZOOKEEPER-912 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-912 Project: Zookeeper Issue Type: Improvement Components: java client Affects Versions: 3.3.1 Reporter: Anthony Urso Assignee: Anthony Urso Priority: Minor Fix For: 3.4.0 Attachments: zk-loglevel.patch ZK logs a lot of uninformative trace and debug messages to level INFO. This fuzzes up everything and makes it easy to miss useful log info. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-914) QuorumCnxManager blocks forever
[ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Junqueira updated ZOOKEEPER-914: --- Component/s: (was: server) (was: quorum) leaderElection QuorumCnxManager blocks forever Key: ZOOKEEPER-914 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914 Project: Zookeeper Issue Type: Bug Components: leaderElection Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.3, 3.4.0 This was a disaster. While testing our application we ran into a scenario where a rebooted follower could not join the cluster. Further debugging showed that the follower could not join because the QuorumCnxManager on the leader was blocked for indefinite amount of time in receiveConnect() Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable [0x7fa9275ed000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) - locked 0x7fa93315f988 (a java.lang.Object) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501) I had pointed out this bug along with several other problems in QuorumCnxManager earlier in https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822. I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch will be out soon. The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does a read() in receiveConnection() and a write() in initiateConnection(). Sorry, but this is really bad programming. Also, points out to lack of failure tests for QuorumCnxManager. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
[ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925746#action_12925746 ] Flavio Junqueira commented on ZOOKEEPER-914: As Pat, I would also appreciate some more constructive comments (and behavior). From the Clover reports, we exercise a significant part of the QCM code, but it is true, though, that we don't test the cases you have been exposing. Here is a way I believe we can reproduce this problem (I haven't implemented it, but seems to make sense). The high-level idea is to make sure that if some server stops responding before it completes the handshake protocol, then no instance of QCM across all servers will block and prevent other servers from joining the ensemble. Suppose we configure an ensemble with 5 servers using QuorumBase. One of the servers will be a simple mock server, as we do in the CnxManagerTest tests. Now here is the sequence of steps to follow: # Start three of the servers and confirm that they accept and execute operations; # Start mock server and execute the protocol partially. For the read case you mention, you can simply not send the server identifier. That will cause the read on the other end to block and to not accept more connections; # Start a 5th server and check if it is able to join the ensemble. A simple fix to have it working for you soon along the lines of what we have done to make the connection timeout configurable seems to be to set SO_TIMEOUT. But, if you have other ideas, please lay them out. Please bear in mind that the major modifications we should leave for ZOOKEEPER-901 because those will take more time to develop and get into shape. QuorumCnxManager blocks forever Key: ZOOKEEPER-914 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914 Project: Zookeeper Issue Type: Bug Components: leaderElection Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.3, 3.4.0 This was a disaster. While testing our application we ran into a scenario where a rebooted follower could not join the cluster. Further debugging showed that the follower could not join because the QuorumCnxManager on the leader was blocked for indefinite amount of time in receiveConnect() Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable [0x7fa9275ed000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) - locked 0x7fa93315f988 (a java.lang.Object) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501) I had pointed out this bug along with several other problems in QuorumCnxManager earlier in https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822. I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch will be out soon. The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does a read() in receiveConnection() and a write() in initiateConnection(). Sorry, but this is really bad programming. Also, points out to lack of failure tests for QuorumCnxManager. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load
[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925754#action_12925754 ] Flavio Junqueira commented on ZOOKEEPER-885: Sure, let's discuss over e-mail and we can post here later our findings. Zookeeper drops connections under moderate IO load -- Key: ZOOKEEPER-885 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.2.2, 3.3.1 Environment: Debian (Lenny) 1Gb RAM swap disabled 100Mb heap for zookeeper Reporter: Alexandre Hardy Priority: Critical Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, WatcherTest.java, zklogs.tar.gz A zookeeper server under minimum load, with a number of clients watching exactly one node will fail to maintain the connection when the machine is subjected to moderate IO load. In a specific test example we had three zookeeper servers running on dedicated machines with 45 clients connected, watching exactly one node. The clients would disconnect after moderate load was added to each of the zookeeper servers with the command: {noformat} dd if=/dev/urandom of=/dev/mapper/nimbula-test {noformat} The {{dd}} command transferred data at a rate of about 4Mb/s. The same thing happens with {noformat} dd if=/dev/zero of=/dev/mapper/nimbula-test {noformat} It seems strange that such a moderate load should cause instability in the connection. Very few other processes were running, the machines were setup to test the connection instability we have experienced. Clients performed no other read or mutation operations. Although the documents state that minimal competing IO load should present on the zookeeper server, it seems reasonable that moderate IO should not cause problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
[ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925843#action_12925843 ] Patrick Hunt commented on ZOOKEEPER-914: Flavio in item 2 you mention mock, consider using mockito, I've had alot of luck with that personally, also Hadoop itself has moved to using this in it's tests. http://code.google.com/p/mockito/ QuorumCnxManager blocks forever Key: ZOOKEEPER-914 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914 Project: Zookeeper Issue Type: Bug Components: leaderElection Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.3, 3.4.0 This was a disaster. While testing our application we ran into a scenario where a rebooted follower could not join the cluster. Further debugging showed that the follower could not join because the QuorumCnxManager on the leader was blocked for indefinite amount of time in receiveConnect() Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable [0x7fa9275ed000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) - locked 0x7fa93315f988 (a java.lang.Object) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501) I had pointed out this bug along with several other problems in QuorumCnxManager earlier in https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822. I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch will be out soon. The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does a read() in receiveConnection() and a write() in initiateConnection(). Sorry, but this is really bad programming. Also, points out to lack of failure tests for QuorumCnxManager. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-912) ZooKeeper client logs trace and debug messages at level INFO
[ https://issues.apache.org/jira/browse/ZOOKEEPER-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925845#action_12925845 ] Patrick Hunt commented on ZOOKEEPER-912: Hi Anthony, I realized in the shower this morning, by Zookeeper did you mean ZooKeeper.java? My bad. I looked at this class again and it does have logging at other levels than just info. Really it should have trace level logs for each of the api entry points. I'm concerned about pushing down the info level logs you highlighted though due to a couple factors; 1) in our experience those msgs are very useful to understand the runtime state of the client, 2) many users don't run in production at trace (and some don't want to run in debug). What's your rule of thumb for what should be logged at the various levels? ZooKeeper client logs trace and debug messages at level INFO Key: ZOOKEEPER-912 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-912 Project: Zookeeper Issue Type: Improvement Components: java client Affects Versions: 3.3.1 Reporter: Anthony Urso Assignee: Anthony Urso Priority: Minor Fix For: 3.4.0 Attachments: zk-loglevel.patch ZK logs a lot of uninformative trace and debug messages to level INFO. This fuzzes up everything and makes it easy to miss useful log info. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-897) C Client seg faults during close
[ https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925846#action_12925846 ] Patrick Hunt commented on ZOOKEEPER-897: perhaps we should rely on existing testing for this one, but enter a new jira to refactor the client, specifically to allow testing? (ie a way to inject the helper code w/o needing to edit zookeeper.c directly) C Client seg faults during close Key: ZOOKEEPER-897 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897 Project: Zookeeper Issue Type: Bug Components: c client Reporter: Jared Cantwell Assignee: Jared Cantwell Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch We observed a crash while closing our c client. It was in the do_io() thread that was processing as during the close() call. #0 queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969 #1 0x0046234e in check_events (zh=0x6bd480, events=value optimized out) at src/zookeeper.c:1687 #2 0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at src/zookeeper.c:1971 #3 0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311 #4 0x77bc59ca in start_thread () from /lib/libpthread.so.0 #5 0x76f706fd in clone () from /lib/libc.so.6 #6 0x in ?? () We tracked down the sequence of events, and the cause is that input_buffer is being freed from a thread other than the do_io thread that relies on it: 1. do_io() call check_events() 2. if(eventsZOOKEEPER_READ) branch executes 3. if (rc 0) branch executes 4. if (zh-input_buffer != zh-primer_buffer) branch executes .in the meantime.. 5. zookeeper_close() called 6. if (inc_ref_counter(zh,0)!=0) branch executes 7. cleanup_bufs() is called 8. input_buffer is freed at the end . back to check_events(). 9. queue_events() is called on a NULL buffer. I believe the patch is to only call free_completions() in zookeeper_close() and not cleanup_bufs(). The original reason cleanup_bufs() was added was to call any outstanding synhcronous completions, so only free_completions (which is guarded) is needed. I will submit a patch for review with this change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-897) C Client seg faults during close
[ https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925858#action_12925858 ] Mahadev konar commented on ZOOKEEPER-897: - jared, pat, I am ok without a test case for this one, because its a quite hard to create one. I just wanted someone else to run the tests on there machines just to verify (since I rarely see any problems in c tests on my machine). I will go ahead and commit this patch for now. C Client seg faults during close Key: ZOOKEEPER-897 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897 Project: Zookeeper Issue Type: Bug Components: c client Reporter: Jared Cantwell Assignee: Jared Cantwell Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch We observed a crash while closing our c client. It was in the do_io() thread that was processing as during the close() call. #0 queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969 #1 0x0046234e in check_events (zh=0x6bd480, events=value optimized out) at src/zookeeper.c:1687 #2 0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at src/zookeeper.c:1971 #3 0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311 #4 0x77bc59ca in start_thread () from /lib/libpthread.so.0 #5 0x76f706fd in clone () from /lib/libc.so.6 #6 0x in ?? () We tracked down the sequence of events, and the cause is that input_buffer is being freed from a thread other than the do_io thread that relies on it: 1. do_io() call check_events() 2. if(eventsZOOKEEPER_READ) branch executes 3. if (rc 0) branch executes 4. if (zh-input_buffer != zh-primer_buffer) branch executes .in the meantime.. 5. zookeeper_close() called 6. if (inc_ref_counter(zh,0)!=0) branch executes 7. cleanup_bufs() is called 8. input_buffer is freed at the end . back to check_events(). 9. queue_events() is called on a NULL buffer. I believe the patch is to only call free_completions() in zookeeper_close() and not cleanup_bufs(). The original reason cleanup_bufs() was added was to call any outstanding synhcronous completions, so only free_completions (which is guarded) is needed. I will submit a patch for review with this change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-897) C Client seg faults during close
[ https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-897: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) I just committed this. thanks jared! C Client seg faults during close Key: ZOOKEEPER-897 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897 Project: Zookeeper Issue Type: Bug Components: c client Reporter: Jared Cantwell Assignee: Jared Cantwell Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch We observed a crash while closing our c client. It was in the do_io() thread that was processing as during the close() call. #0 queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969 #1 0x0046234e in check_events (zh=0x6bd480, events=value optimized out) at src/zookeeper.c:1687 #2 0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at src/zookeeper.c:1971 #3 0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311 #4 0x77bc59ca in start_thread () from /lib/libpthread.so.0 #5 0x76f706fd in clone () from /lib/libc.so.6 #6 0x in ?? () We tracked down the sequence of events, and the cause is that input_buffer is being freed from a thread other than the do_io thread that relies on it: 1. do_io() call check_events() 2. if(eventsZOOKEEPER_READ) branch executes 3. if (rc 0) branch executes 4. if (zh-input_buffer != zh-primer_buffer) branch executes .in the meantime.. 5. zookeeper_close() called 6. if (inc_ref_counter(zh,0)!=0) branch executes 7. cleanup_bufs() is called 8. input_buffer is freed at the end . back to check_events(). 9. queue_events() is called on a NULL buffer. I believe the patch is to only call free_completions() in zookeeper_close() and not cleanup_bufs(). The original reason cleanup_bufs() was added was to call any outstanding synhcronous completions, so only free_completions (which is guarded) is needed. I will submit a patch for review with this change. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model
[ https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Junqueira updated ZOOKEEPER-702: --- Status: Open (was: Patch Available) Hi Abmar, Thanks for the addition to the patch. I was wondering if it is really a good idea to have both options, normal and exponential, implemented. Since your experiments have shown that exponential performs better, why don't use it only? Also, I was wondering if you have posted expertimental numbers showing that exponential performs better. In the case we go with exponential only, then we don't need the modification to ivy.xml, right? And last comment, it doesn't look like the classes implementing PhiTimeoutEvaluator need to be public. Is this right? GSoC 2010: Failure Detector Model - Key: ZOOKEEPER-702 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702 Project: Zookeeper Issue Type: Wish Reporter: Henry Robinson Assignee: Abmar Barros Fix For: 3.4.0 Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch Failure Detector Module Possible Mentor Henry Robinson (henry at apache dot org) Requirements Java, some distributed systems knowledge, comfort implementing distributed systems protocols Description ZooKeeper servers detects the failure of other servers and clients by counting the number of 'ticks' for which it doesn't get a heartbeat from other machines. This is the 'timeout' method of failure detection and works very well; however it is possible that it is too aggressive and not easily tuned for some more unusual ZooKeeper installations (such as in a wide-area network, or even in a mobile ad-hoc network). This project would abstract the notion of failure detection to a dedicated Java module, and implement several failure detectors to compare and contrast their appropriateness for ZooKeeper. For example, Apache Cassandra uses a phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which is much more tunable and has some very interesting properties. This is a great project if you are interested in distributed algorithms, or want to help re-factor some of ZooKeeper's internal code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-898) C Client might not cleanup correctly during close
[ https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925909#action_12925909 ] Mahadev konar commented on ZOOKEEPER-898: - +1, good catch jared. I just committed this to 3.3 and trunk. thanks! C Client might not cleanup correctly during close - Key: ZOOKEEPER-898 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898 Project: Zookeeper Issue Type: Bug Components: c client Reporter: Jared Cantwell Assignee: Jared Cantwell Priority: Trivial Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch I was looking through the c-client code and noticed a situation where a counter can be incorrectly incremented and a small memory leak can occur. In zookeeper.c : add_completion(), if close_requested is true, then the completion will not be queued. But at the end, outstanding_sync is still incremented and free() never called on the newly allocated completion_list_t. I will submit for review a diff that I believe corrects this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-898) C Client might not cleanup correctly during close
[ https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-898: Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) C Client might not cleanup correctly during close - Key: ZOOKEEPER-898 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898 Project: Zookeeper Issue Type: Bug Components: c client Reporter: Jared Cantwell Assignee: Jared Cantwell Priority: Trivial Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch I was looking through the c-client code and noticed a situation where a counter can be incorrectly incremented and a small memory leak can occur. In zookeeper.c : add_completion(), if close_requested is true, then the completion will not be queued. But at the end, outstanding_sync is still incremented and free() never called on the newly allocated completion_list_t. I will submit for review a diff that I believe corrects this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-517) NIO factory fails to close connections when the number of file handles run out.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-517: Fix Version/s: (was: 3.3.2) 3.3.3 moving this out to 3.3.3 and 3.4 for investigation. NIO factory fails to close connections when the number of file handles run out. --- Key: ZOOKEEPER-517 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-517 Project: Zookeeper Issue Type: Bug Components: server Reporter: Mahadev konar Assignee: Benjamin Reed Priority: Critical Fix For: 3.3.3, 3.4.0 The code in NIO factory is such that if we fail to accept a connection due to some reasons (too many file handles maybe one of them) we do not close the connections that are in CLOSE_WAIT. We need to call an explicit close on these sockets and then close them. One of the solutions might be to move doIO before accpet so that we can still close connection even if we cannot accept connections. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-667) java client doesn't allow ipv6 numeric connect string
[ https://issues.apache.org/jira/browse/ZOOKEEPER-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925923#action_12925923 ] Gunnar Wagenknecht commented on ZOOKEEPER-667: -- I'm wondering if there is something else to do for proper connecting on IPv6 systems. I'm on Windows 7 with Java 6 and I do have the following exception in my logs. When connecting I specified {{localhost:2181}}. {noformat} 21:20:04.157 [Worker-0-SendThread()] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 21:20:04.188 [Worker-0-SendThread(localhost:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.SocketException: Address family not supported by protocol family: connect at sun.nio.ch.Net.connect(Native Method) ~[na:1.6.0_21] at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) ~[na:1.6.0_21] at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1009) ~[na:na] at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1036) ~[na:na] {noformat} java client doesn't allow ipv6 numeric connect string - Key: ZOOKEEPER-667 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-667 Project: Zookeeper Issue Type: Bug Components: java client Affects Versions: 3.2.2 Reporter: Patrick Hunt Assignee: Patrick Hunt Priority: Critical Fix For: 3.3.0 The java client doesn't handle ipv6 numeric addresses as they are colon (:) delmited. After splitting the host/port on : we look for the port as the second entry in the array rather than the last entry in the array. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925927#action_12925927 ] Henry Robinson commented on ZOOKEEPER-851: -- Hi Vishal - Sorry for the slow turnaround on this one. It doesn't surprise me that this is the behaviour, although it's slightly unexpected that the node becomes an observer, rather than a follower. What evidence do you have for that? (Given that Mode: follower - I haven't checked the code in a while, but I would have thought it would print Mode: Observer). Henry ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal K Priority: Critical Fix For: 3.4.0 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925976#action_12925976 ] Benjamin Reed commented on ZOOKEEPER-907: - Ah, I see the problem. There are actually two problems: 1) when sync() get's an error it is not propagated back to the caller. 2) this problem. They problem is that 1) is preventing us from writing a test case. We need to fix 1) and then we can write the test for 2). Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2 The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-915) Errors that happen during sync() processing at the leader do not get propagated back to the client.
Errors that happen during sync() processing at the leader do not get propagated back to the client. --- Key: ZOOKEEPER-915 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-915 Project: Zookeeper Issue Type: Bug Reporter: Benjamin Reed If an error in sync() processing happens at the leader (SESSION_MOVED for example), they are not propagated back to the client. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925997#action_12925997 ] Vishal K commented on ZOOKEEPER-851: I have not used observers yet. Looks like then I need to verify if this follower can become a leader. I think I had tried this. If I remember, the 4th node did not become a leader. But this was a while back. I will try it again and update the jira. ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal K Priority: Critical Fix For: 3.4.0 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-805) four letter words fail with latest ubuntu nc.openbsd
[ https://issues.apache.org/jira/browse/ZOOKEEPER-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-805: --- Fix Version/s: (was: 3.3.2) 3.3.3 Not a blocker, pushing to 3.3.3/3.4.0 four letter words fail with latest ubuntu nc.openbsd Key: ZOOKEEPER-805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-805 Project: Zookeeper Issue Type: Bug Components: documentation, server Affects Versions: 3.3.1, 3.4.0 Reporter: Patrick Hunt Priority: Critical Fix For: 3.3.3, 3.4.0 In both 3.3 branch and trunk echo stat|nc localhost 2181 fails against the ZK server on Ubuntu Lucid Lynx. I noticed this after upgrading to lucid lynx - which is now shipping openbsd nc as the default: OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2) vs nc traditional [v1.10-38] which works fine. Not sure if this is a bug in us or nc.openbsd, but it's currently not working for me. Ugh. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-815) fill in TBDs in overview doc
[ https://issues.apache.org/jira/browse/ZOOKEEPER-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-815: --- Fix Version/s: (was: 3.3.2) 3.3.3 Not a blocker, pushing to 3.3.3/3.4.0. fill in TBDs in overview doc -- Key: ZOOKEEPER-815 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-815 Project: Zookeeper Issue Type: Bug Components: documentation Affects Versions: 3.3.1 Reporter: Patrick Hunt Priority: Minor Fix For: 3.3.3, 3.4.0 Funny: Ephemeral nodes are useful when you want to implement [tbd]. there are a few others in that doc that are should really be fixed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926004#action_12926004 ] Mahadev konar commented on ZOOKEEPER-907: - ben, can is ZOOKEEPER-915 also marked for 3.3.2 then? Spurious KeeperErrorCode = Session moved messages --- Key: ZOOKEEPER-907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2 The sync request does not set the session owner in Request. As a result, the leader keeps printing: 2010-07-01 10:55:36,733 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x298d3b1fa9 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error Path:null Error:KeeperErrorCode = Session moved -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-898) C Client might not cleanup correctly during close
[ https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926008#action_12926008 ] Hudson commented on ZOOKEEPER-898: -- Integrated in ZooKeeper-trunk #983 (See [https://hudson.apache.org/hudson/job/ZooKeeper-trunk/983/]) ZOOKEEPER-898. C Client might not cleanup correctly during close (jared cantwell via mahadev) C Client might not cleanup correctly during close - Key: ZOOKEEPER-898 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898 Project: Zookeeper Issue Type: Bug Components: c client Reporter: Jared Cantwell Assignee: Jared Cantwell Priority: Trivial Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch I was looking through the c-client code and noticed a situation where a counter can be incorrectly incremented and a small memory leak can occur. In zookeeper.c : add_completion(), if close_requested is true, then the completion will not be queued. But at the end, outstanding_sync is still incremented and free() never called on the newly allocated completion_list_t. I will submit for review a diff that I believe corrects this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926016#action_12926016 ] Henry Robinson commented on ZOOKEEPER-851: -- I think what happens is that the leader happily lets the new follower connect, but that it won't be part of any voting procedure. It shouldn't become leader because no other nodes know about it to propose or support a vote for it. To add a new node, you'll need to incrementally restart every node in your cluster with the new config. ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal K Priority: Critical Fix For: 3.4.0 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.