[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-07 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929345#action_12929345
 ] 

Flavio Junqueira commented on ZOOKEEPER-914:


Hi Vishal, I also appreciate your contributions and your comments. I also 
understand your frustration when you find issues with the code, but think that 
it is possibly equally frustrating for the developer who thought that at least 
basic issues were covered, so please try to think that we don't introduce bugs 
on purpose (at least I don't) and our review process is not perfect. 

Regarding clover reports, we have agreed already that code coverage is not 
bulletproof, and in fact there has been several other metrics proposed in the 
scientific literature, but it does indicate that some call path including a 
give piece of code was exercised. It certainly doesn't measure more complex 
cases, like race conditions, crashes and so on. In fact, if you have a better 
way of measuring test coverage, I'd happy to hear about it.

I'm not sure if you agree, but it seems to me that we should close this jira 
because the technical discussion here seems to be similar to the one of 
ZOOKEEPER-900. I'll try to address the concerns you raised regardless of what 
will happen to this jira:

# My point about SO_TIMEOUT comes from here: 
http://download.oracle.com/javase/6/docs/api/java/net/Socket.html#setSoTimeout%28int%29
# I obviously prefer to go with real fixes instead of hacking, but we need to 
have release 3.3.2 out, and it sounded like introducing a configurable timeout 
would fix your problem until the next release;
# About testing beyond the handshake, I'm not sure what you're proposing. If 
the blocking calls are part of the handshake and this is what is failing for 
you, then this is what we should target now, no?   

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-07 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929354#action_12929354
 ] 

Flavio Junqueira commented on ZOOKEEPER-917:


Hi Vishal, It is certainly understand not having dedicated development time 
being an issue. I actually didn't know you're interested in the cluster 
membership... I'm glad to hear, though.

On your questions:
# Suppose we have an ensemble comprising 3 servers: A, B, and C. Now suppose 
that C is the leader, and both A and B follow C. If A disconnects from C for 
whatever reason (e.g., network partition) and it tries to elect a leader, it 
won't get any other process in the LOOKING state. It will actually receive a 
notification from C saying that it is leading and one from B saying that it is 
following C, both with an earlier leader election epoch. To avoid having A 
locked out (not able to elect C as leader), we implemented this exception: a 
process accepts going back to an earlier leader election only if it receives a 
notification from the leader saying that it is leading and from a quorum saying 
that it is following;
# I'm not sure if you referring to specific problem of this jira or if you are 
asking about my hypothetical example. Assuming it is the former, the follower 
(Follower:followLeader()) checks if the leader is proposing an earlier epoch, 
and if not, it accepts the leader snapshot. Because the epoch is the same, all 
followers will accept the leader snapshot follow it. 

 Leader election selected incorrect leader
 -

 Key: ZOOKEEPER-917
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.2.2
 Environment: Cloudera distribution of zookeeper (patched to never 
 cache DNS entries)
 Debian lenny
Reporter: Alexandre Hardy
Priority: Critical
 Fix For: 3.3.3, 3.4.0

 Attachments: zklogs-20101102144159SAST.tar.gz


 We had three nodes running zookeeper:
   * 192.168.130.10
   * 192.168.130.11
   * 192.168.130.14
 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 
 (automated startup). The new node had not participated in any zookeeper 
 quorum previously. The node 192.148.130.11 was permanently removed from 
 service and could not contribute to the quorum any further (powered off).
 DNS entries were updated for the new node to allow all the zookeeper servers 
 to find the new node.
 The new node 192.168.130.13 was selected as the LEADER, despite the fact that 
 it had not seen the latest zxid.
 This particular problem has not been verified with later versions of 
 zookeeper, and no attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-07 Thread Thomas Koch (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929395#action_12929395
 ] 

Thomas Koch commented on ZOOKEEPER-909:
---

I've added some javadoc and renamed socket to clientCnxnSocket everywhere. I'll 
upload the patch when the tests completed.

Moving ZooKeeper.state to ClientCnxn.state came in handy at some point to avoid 
unnecessary redirection and to clean the object dependency graph ( 
ZOOKEEPER-837 ). The state variable is only once accessed from ZooKeeper but 
several times from ClientCnxn so it makes a lot of sense to move it where it is 
needed.

SessionExpiredException can actually remain private, but EndOfStreamException 
is used in ClientCnxnSocketNIO.

 Extract NIO specific code from ClientCnxn
 -

 Key: ZOOKEEPER-909
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
 Project: Zookeeper
  Issue Type: Sub-task
  Components: java client
Reporter: Thomas Koch
Assignee: Thomas Koch
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, 
 ZOOKEEPER-909.patch


 This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus 
 everything Netty related. This means this patch only extract all NIO specific 
 code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
 I've redone this patch from current trunk step by step now and couldn't find 
 any logical error. I've already done a couple of successful test runs and 
 will continue to do so this night.
 It would be nice, if we could apply this patch as soon as possible to trunk. 
 This allows us to continue to work on the netty integration without blocking 
 the ClientCnxn class. Adding Netty after this patch should be only a matter 
 of adding the ClientCnxnSocketNetty class with the appropriate test cases.
 You could help me by reviewing the patch and by running it on whatever test 
 server you have available. Please send me any complete failure log you should 
 encounter to thomas at koch point ro. Thx!
 Update: Until now, I've collected 8 successful builds in a row!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-07 Thread Thomas Koch (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Koch updated ZOOKEEPER-909:
--

Attachment: ZOOKEEPER-909.patch

I couldn't wait for the tests to finish now, but it compiles and I did only 
renames, comments, visibility change and removal of dead code since the last 
patch.

 Extract NIO specific code from ClientCnxn
 -

 Key: ZOOKEEPER-909
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
 Project: Zookeeper
  Issue Type: Sub-task
  Components: java client
Reporter: Thomas Koch
Assignee: Thomas Koch
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, 
 ZOOKEEPER-909.patch, ZOOKEEPER-909.patch


 This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus 
 everything Netty related. This means this patch only extract all NIO specific 
 code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
 I've redone this patch from current trunk step by step now and couldn't find 
 any logical error. I've already done a couple of successful test runs and 
 will continue to do so this night.
 It would be nice, if we could apply this patch as soon as possible to trunk. 
 This allows us to continue to work on the netty integration without blocking 
 the ClientCnxn class. Adding Netty after this patch should be only a matter 
 of adding the ClientCnxnSocketNetty class with the appropriate test cases.
 You could help me by reviewing the patch and by running it on whatever test 
 server you have available. Please send me any complete failure log you should 
 encounter to thomas at koch point ro. Thx!
 Update: Until now, I've collected 8 successful builds in a row!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-909) Extract NIO specific code from ClientCnxn

2010-11-07 Thread Thomas Koch (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Koch updated ZOOKEEPER-909:
--

Status: Patch Available  (was: Open)

 Extract NIO specific code from ClientCnxn
 -

 Key: ZOOKEEPER-909
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-909
 Project: Zookeeper
  Issue Type: Sub-task
  Components: java client
Reporter: Thomas Koch
Assignee: Thomas Koch
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-909.patch, ZOOKEEPER-909.patch, 
 ZOOKEEPER-909.patch, ZOOKEEPER-909.patch


 This patch is mostly the same patch as my last one for ZOOKEEPER-823 minus 
 everything Netty related. This means this patch only extract all NIO specific 
 code in the class ClientCnxnSocketNIO which extends ClientCnxnSocket.
 I've redone this patch from current trunk step by step now and couldn't find 
 any logical error. I've already done a couple of successful test runs and 
 will continue to do so this night.
 It would be nice, if we could apply this patch as soon as possible to trunk. 
 This allows us to continue to work on the netty integration without blocking 
 the ClientCnxn class. Adding Netty after this patch should be only a matter 
 of adding the ClientCnxnSocketNetty class with the appropriate test cases.
 You could help me by reviewing the patch and by running it on whatever test 
 server you have available. Please send me any complete failure log you should 
 encounter to thomas at koch point ro. Thx!
 Update: Until now, I've collected 8 successful builds in a row!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-920) L7 (application layer) ping support

2010-11-07 Thread Chang Song (JIRA)
L7 (application layer) ping support
---

 Key: ZOOKEEPER-920
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-920
 Project: Zookeeper
  Issue Type: New Feature
  Components: c client
Affects Versions: 3.3.1
Reporter: Chang Song
Priority: Minor


Zookeeper is used in applications where fault tolerance is important. Its 
client i/o thread send/recv heartbeats to/fro Zookeeper ensemble to stay 
connected. However healthy heartbeat does not always means that the application 
that uses Zookeeper client is in good health, it only means that ZK client 
thread is in good health.

This I needed something that can tagged onto Zookeeper ping that represents L7 
(application) health as well.
I have modified C client source to support this in minimal way. I am new to 
Zookeeper, so please code review this code.  I am actually using this code in 
our in-house solution.

https://github.com/tru64ufs/zookeeper/commit/2196d6d5114a2fd2c0a3bc9a55f4494d47d2aece

Thank you very much.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.