[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
[ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929345#action_12929345 ] Flavio Junqueira commented on ZOOKEEPER-914: Hi Vishal, I also appreciate your contributions and your comments. I also understand your frustration when you find issues with the code, but think that it is possibly equally frustrating for the developer who thought that at least basic issues were covered, so please try to think that we don't introduce bugs on purpose (at least I don't) and our review process is not perfect. Regarding clover reports, we have agreed already that code coverage is not bulletproof, and in fact there has been several other metrics proposed in the scientific literature, but it does indicate that some call path including a give piece of code was exercised. It certainly doesn't measure more complex cases, like race conditions, crashes and so on. In fact, if you have a better way of measuring test coverage, I'd happy to hear about it. I'm not sure if you agree, but it seems to me that we should close this jira because the technical discussion here seems to be similar to the one of ZOOKEEPER-900. I'll try to address the concerns you raised regardless of what will happen to this jira: # My point about SO_TIMEOUT comes from here: http://download.oracle.com/javase/6/docs/api/java/net/Socket.html#setSoTimeout%28int%29 # I obviously prefer to go with real fixes instead of hacking, but we need to have release 3.3.2 out, and it sounded like introducing a configurable timeout would fix your problem until the next release; # About testing beyond the handshake, I'm not sure what you're proposing. If the blocking calls are part of the handshake and this is what is failing for you, then this is what we should target now, no? QuorumCnxManager blocks forever Key: ZOOKEEPER-914 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914 Project: Zookeeper Issue Type: Bug Components: leaderElection Reporter: Vishal K Assignee: Vishal K Priority: Blocker Fix For: 3.3.3, 3.4.0 This was a disaster. While testing our application we ran into a scenario where a rebooted follower could not join the cluster. Further debugging showed that the follower could not join because the QuorumCnxManager on the leader was blocked for indefinite amount of time in receiveConnect() Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable [0x7fa9275ed000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) - locked 0x7fa93315f988 (a java.lang.Object) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501) I had pointed out this bug along with several other problems in QuorumCnxManager earlier in https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822. I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch will be out soon. The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does a read() in receiveConnection() and a write() in initiateConnection(). Sorry, but this is really bad programming. Also, points out to lack of failure tests for QuorumCnxManager. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader
[ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929354#action_12929354 ] Flavio Junqueira commented on ZOOKEEPER-917: Hi Vishal, It is certainly understand not having dedicated development time being an issue. I actually didn't know you're interested in the cluster membership... I'm glad to hear, though. On your questions: # Suppose we have an ensemble comprising 3 servers: A, B, and C. Now suppose that C is the leader, and both A and B follow C. If A disconnects from C for whatever reason (e.g., network partition) and it tries to elect a leader, it won't get any other process in the LOOKING state. It will actually receive a notification from C saying that it is leading and one from B saying that it is following C, both with an earlier leader election epoch. To avoid having A locked out (not able to elect C as leader), we implemented this exception: a process accepts going back to an earlier leader election only if it receives a notification from the leader saying that it is leading and from a quorum saying that it is following; # I'm not sure if you referring to specific problem of this jira or if you are asking about my hypothetical example. Assuming it is the former, the follower (Follower:followLeader()) checks if the leader is proposing an earlier epoch, and if not, it accepts the leader snapshot. Because the epoch is the same, all followers will accept the leader snapshot follow it. Leader election selected incorrect leader - Key: ZOOKEEPER-917 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917 Project: Zookeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.2.2 Environment: Cloudera distribution of zookeeper (patched to never cache DNS entries) Debian lenny Reporter: Alexandre Hardy Priority: Critical Fix For: 3.3.3, 3.4.0 Attachments: zklogs-20101102144159SAST.tar.gz We had three nodes running zookeeper: * 192.168.130.10 * 192.168.130.11 * 192.168.130.14 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 (automated startup). The new node had not participated in any zookeeper quorum previously. The node 192.148.130.11 was permanently removed from service and could not contribute to the quorum any further (powered off). DNS entries were updated for the new node to allow all the zookeeper servers to find the new node. The new node 192.168.130.13 was selected as the LEADER, despite the fact that it had not seen the latest zxid. This particular problem has not been verified with later versions of zookeeper, and no attempt has been made to reproduce this problem as yet. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929395#action_12929395 ] Thomas Koch commented on ZOOKEEPER-909: --- I've added some javadoc and renamed socket to clientCnxnSocket everywhere. I'll upload the patch when the tests completed. Moving ZooKeeper.state to ClientCnxn.state came in handy at some point to avoid unnecessary redirection and to clean the object dependency graph ( ZOOKEEPER-837 ). The state variable is only once accessed from ZooKeeper but several times from ClientCnxn so it makes a lot of sense to move it where it is needed. SessionExpiredException can actually remain private, but EndOfStreamException is used in ClientCnxnSocketNIO. Extract NIO specific code from ClientCnxn - Key: ZOOKEEPER-909 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909 Project: Zookeeper Issue Type: Sub-task Components: java client Reporter: Thomas Koch Assignee: Thomas Koch Fix For: 3.4.0 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, ZOOKEEPER-909.patch This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus everything Netty related. This means this patch only extract all NIO specific code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket. I've redone this patch from current trunk step by step now and couldn't find any logical error. I've already done a couple of successful test runs and will continue to do so this night. It would be nice, if we could apply this patch as soon as possible to trunk. This allows us to continue to work on the netty integration without blocking the ClientCnxn class. Adding Netty after this patch should be only a matter of adding the ClientCnxnSocketNetty class with the appropriate test cases. You could help me by reviewing the patch and by running it on whatever test server you have available. Please send me any complete failure log you should encounter to thomas at koch point ro. Thx! Update: Until now, I've collected 8 successful builds in a row! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-909: -- Attachment: ZOOKEEPER-909.patch I couldn't wait for the tests to finish now, but it compiles and I did only renames, comments, visibility change and removal of dead code since the last patch. Extract NIO specific code from ClientCnxn - Key: ZOOKEEPER-909 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909 Project: Zookeeper Issue Type: Sub-task Components: java client Reporter: Thomas Koch Assignee: Thomas Koch Fix For: 3.4.0 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, ZOOKEEPER-909.patch This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus everything Netty related. This means this patch only extract all NIO specific code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket. I've redone this patch from current trunk step by step now and couldn't find any logical error. I've already done a couple of successful test runs and will continue to do so this night. It would be nice, if we could apply this patch as soon as possible to trunk. This allows us to continue to work on the netty integration without blocking the ClientCnxn class. Adding Netty after this patch should be only a matter of adding the ClientCnxnSocketNetty class with the appropriate test cases. You could help me by reviewing the patch and by running it on whatever test server you have available. Please send me any complete failure log you should encounter to thomas at koch point ro. Thx! Update: Until now, I've collected 8 successful builds in a row! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Koch updated ZOOKEEPER-909: -- Status: Patch Available (was: Open) Extract NIO specific code from ClientCnxn - Key: ZOOKEEPER-909 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909 Project: Zookeeper Issue Type: Sub-task Components: java client Reporter: Thomas Koch Assignee: Thomas Koch Fix For: 3.4.0 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, ZOOKEEPER-909.patch This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus everything Netty related. This means this patch only extract all NIO specific code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket. I've redone this patch from current trunk step by step now and couldn't find any logical error. I've already done a couple of successful test runs and will continue to do so this night. It would be nice, if we could apply this patch as soon as possible to trunk. This allows us to continue to work on the netty integration without blocking the ClientCnxn class. Adding Netty after this patch should be only a matter of adding the ClientCnxnSocketNetty class with the appropriate test cases. You could help me by reviewing the patch and by running it on whatever test server you have available. Please send me any complete failure log you should encounter to thomas at koch point ro. Thx! Update: Until now, I've collected 8 successful builds in a row! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-920) L7 (application layer) ping support
L7 (application layer) ping support --- Key: ZOOKEEPER-920 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-920 Project: Zookeeper Issue Type: New Feature Components: c client Affects Versions: 3.3.1 Reporter: Chang Song Priority: Minor Zookeeper is used in applications where fault tolerance is important. Its client i/o thread send/recv heartbeats to/fro Zookeeper ensemble to stay connected. However healthy heartbeat does not always means that the application that uses Zookeeper client is in good health, it only means that ZK client thread is in good health. This I needed something that can tagged onto Zookeeper ping that represents L7 (application) health as well. I have modified C client source to support this in minimal way. I am new to Zookeeper, so please code review this code. I am actually using this code in our in-house solution. https://github.com/tru64ufs/zookeeper/commit/2196d6d5114a2fd2c0a3bc9a55f4494d47d2aece Thank you very much. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.