[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-08 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929589#action_12929589
 ] 

Vishal K commented on ZOOKEEPER-914:


Hi Flavio,

You are right. Sorry, my comment was not fair.

Regarding SO_TIMEOUT: Per my understanding, SO_TIMEOUT works only when a 
channel is set in non-blocking mode using isConfigureBlocking(). If the channel 
is not configured to work in non-blocking mode, setting SO_TIMEOUT has no 
effect. Please let me know if you think there is a way to set timeout on the 
socket after accepting the connection (without configuring the channel in 
non-blocking mode). The only way I know to use SO_TIMEOUT is by using 
channel.isConfigureBlocking(false). The current code in QuorumCnxManager 
assumes use of blocking IO. We will have to handle partial reads/writes. Please 
refer to my earlier question regarding SO_TIMEOUT for implementing non-blocking 
IO.

I thought this fix was supposed to go in for 3.3.3. As I suggested earlier, one 
quick fix to the problem is to use TimerTask(). Before doing blocking IO we can 
start a timer for that channel (in receiveConnect() before read). Once the 
timer expires, check if the read() has finished. If not, interrupt and close 
the channel. I think having such a fix (or some other fix that will get around 
the problem) until the real fix is in is a better approach. Let me what you 
think?

If we decide to go one of the quick fixes, then we can use this JIRA for that 
and use ZOOKEEPER-900 for the real fix.. Otherwise, as you suggested, we can 
close this JIRA and use ZOOKEEPER-900.

-Vishal

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-08 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929675#action_12929675
 ] 

Flavio Junqueira commented on ZOOKEEPER-914:


Hi Vishal, The Socket documentation does sound ambiguous, but my understanding 
is that SO_TIMEOUT is for blocking mode, not non-blocking mode. Non-blocking 
calls return immediately, so they shouldn't need a timeout value, no? 
Independent of using it or not, I would be curious to learn if my understanding 
is incorrect.

About the release to include the fix, I think Mahdev later came and changed it 
to 3.3.3. It is fine with me, and we just need to check what the schedule for 
3.3.3 is. My preference is to work directly on ZOOKEEPER-900 (or 901, which I 
think might be a more significant change), if you think we can produce a patch 
in time for 3.3.3. 

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-08 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929681#action_12929681
 ] 

Vishal K commented on ZOOKEEPER-914:


Hi Flavio,

The documentation is not clear.
SO_TIMEOUT  has not effect on blocking channels. Non-blocking channels, wait 
for the specified timeout if nothing is available in the buffer. Otherwise, it 
returns whatever bytes are currently available in the buffer. I wrote a test 
the following test to verify this. Let me know if you know about way to make 
SO_TIMEOUT to work.
 
QuorumPeer peerLeader = new QuorumPeer(peers, tmpdir[1], tmpdir[1], 
port[1], 3, 0, 2, 2, 2);
QuorumCnxManager cnxManager = new QuorumCnxManager(peerLeader);
QuorumCnxManager.Listener listener = cnxManager.listener;
SocketChannel channel = SocketChannel.open();
channel.socket().connect(peers.get(new Long(0)).electionAddr, 5000);
channel.configureBlocking(false);
channel.socket().setSoTimeout(1000);
byte[] msgBytes = new byte[8];
ByteBuffer msgBuffer = ByteBuffer.wrap(msgBytes);

/**
 * Don't send any data and call read() and see how long it waits.
 */
long begin = System.currentTimeMillis();
channel.read(msgBuffer);
   long end = System.currentTimeMillis();

Feel to free close duplicate bugs.

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-08 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929724#action_12929724
 ] 

Vishal K commented on ZOOKEEPER-914:


OK, that didn't make sense. My comment about non-blocking channels was 
incorrect. They are non-blocking :-) Doesn't wait till SO_TIMEOUT.

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-07 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929345#action_12929345
 ] 

Flavio Junqueira commented on ZOOKEEPER-914:


Hi Vishal, I also appreciate your contributions and your comments. I also 
understand your frustration when you find issues with the code, but think that 
it is possibly equally frustrating for the developer who thought that at least 
basic issues were covered, so please try to think that we don't introduce bugs 
on purpose (at least I don't) and our review process is not perfect. 

Regarding clover reports, we have agreed already that code coverage is not 
bulletproof, and in fact there has been several other metrics proposed in the 
scientific literature, but it does indicate that some call path including a 
give piece of code was exercised. It certainly doesn't measure more complex 
cases, like race conditions, crashes and so on. In fact, if you have a better 
way of measuring test coverage, I'd happy to hear about it.

I'm not sure if you agree, but it seems to me that we should close this jira 
because the technical discussion here seems to be similar to the one of 
ZOOKEEPER-900. I'll try to address the concerns you raised regardless of what 
will happen to this jira:

# My point about SO_TIMEOUT comes from here: 
http://download.oracle.com/javase/6/docs/api/java/net/Socket.html#setSoTimeout%28int%29
# I obviously prefer to go with real fixes instead of hacking, but we need to 
have release 3.3.2 out, and it sounded like introducing a configurable timeout 
would fix your problem until the next release;
# About testing beyond the handshake, I'm not sure what you're proposing. If 
the blocking calls are part of the handshake and this is what is failing for 
you, then this is what we should target now, no?   

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-11-04 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928441#action_12928441
 ] 

Patrick Hunt commented on ZOOKEEPER-914:


Hi Vishal we do appreciate your feedback and interest. You've been doing a 
great job highlighting issues and working to resolve them. Again, thanks. 

We also feel your frustrations. We wish we had unlimited time and resources to 
develop and test ZK, unfortunately that's not the case. This is one of the many 
reasons why we brought the project to Apache, to build community and gain 
insights of developers and users such as yourself. Is everything done, is it 
all perfect code? No. However the source is open, the process is open, and we 
hope that more contributors will sign on to working together and making 
significant contributions. This doesn't have to be just new features, it very 
much could be testing (code and QA), documentation and all the other bits that 
go into useful software.

I encourage you to bring your QA related concerns to the larger group. That's 
something that should be discussed on the dev list rather than here in a jira 
for a specific issue. As you can see the primary committers work hard to 
address all the issues found. However there's just not enough of us (and we 
ourselves work on this in our spare time to varying degrees). Perhaps others 
will feel similarly and you can work to address some of the deficiencies. I'd 
*love* to see more unit test and more system testing. If you want to make that 
happen I'd do my best to support you.

Regards. (I'll let Flavio comment on the further specifics of this particular 
issue)


 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-29 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926343#action_12926343
 ] 

Vishal K commented on ZOOKEEPER-914:


Hi Pat, Flavio,

I will begin with admitting that my comment about bad programming
was not a constructive comment and unwarranted. One can argue that such
a comment can be viewed as constructive since it raises a red alert on
the quality. But I understand that this is highly subjective, and hence,
should be avoided. However, I stand corrected for my comment about
lack of testing/tests.

I would like to take a moment here to explain our frustrations and
then I will get back to this bug and my suggestions to improve
QuorumCnxManager and testing. While reading the first part of my
comments please ask yourself why were these issues not uncovered
prior to checkins? rather than why is this guy complaining?.
You may find these to be constructive as well.
Finally, I would like to point out that for most of the
issues listed below I have tried to help by debugging and/or providing
patches. Also, we are interested in and will continue to contribute to
ZooKeeper.

We wrote an application on  top of ZooKeeper. We started testing
our application to see how it handles failures. We rebooted the
follower and we immediately  ran into ZOOKEEPER-335 (zookeeper
servers should commit the new leader txn to their logs.). We then
tried to reboot the leader and we ran into several bugs reported in
ZOOKEEPER-822 (Leader election taking a long time to complete). Once,
we misconfigured one of our ZooKeeper servers and we ran into bug
ZOOKEEPER-851 (ZK lets any node to become an observer). We made a
minor change to our client code and we ran into bug ZOOKEEPER-907
(KeeperErrorCode = Session moved messages), which also happens to
identify ZOOKEEPER-915 (Errors that happen during sync() processing at
the leader do not get propagated back to the client). A few days back
we rebooted our follower and ran into ZOOKEEPER-914 (QuorumCnxManager
blocks forever). There are a few other issues that I haven't reported
yet (still debugging).

Looking at the reported bugs, I believe that
almost all of them fall under sanity/basic failure testing
category. Therefore, if you look at it from our view, clover reports
and arguments about the number of tests that cover the code path in
question are great, but are not convincing. Anyways, now I will
conclude my end of the argument and move forward to look at the real issues
at hand. Hopefully, you will find the comments below to be constructive.

Moving on...

1. AFAIK, SO_TIMEOUT does not work for blocking channels. Is there a
way to set timeout on blocking channels? If not, we will have to use
non-blocking channels and then make sure that we handle read/write
correctly, because a read/write can return partial results or
non-blocking channels. I noticed that Learner.java uses
BinaryInputArchive from Jute in non-blocking mode. Should we use that?
Also note that QorumCnxManager after accepting connection reads the
first 8 bytes from the channel buffer and assumes that it is a server
ID. It does not have a tag to indicate packet/request type.

2. We could put a hack to timeout calls in receiveConnection and
InitiateConnection using TimerTask (start a timer and interrupt of
read hasn't returned after the timer expires) or Threads. But I would
rather go for the real fix.

3. Testing failures - Flavio, in addition to handshake protocol, we
will need to test failures post handshake (see initiateConnnction) to
ensure that a server does not block while writing if the receiver is
down. We need a way to introduce faults in the code. At my earlier
job, when we implemented a clustered system, we had a way to write
some form of assert statements in our code. While writing the code we
would put asserts and critical places. We could then enable these
asserts (using the assert name) in our tests and trigger
faults. Asserts could be used only in debug mode. In addition, we had
assert actions, which could essentially execute a specified method
(action). We introduced faults usin these these methods. This was done
using propriety library written in C. I am fairly new to the Java
world, but I am guessing there is a tool to do something similar
(maybe mockito?).

Also, in addition, to the  failure tests, we should periodically do real
failure testing. For example, rebootingnodes. In our experience, such
testing introduces unexpected latencies (e.g., exposes code to TCP
timeouts).

In our application, we have a RMI server that does management of
ZooKeeper (start/stop/etc) in addition to other management tasks for our
application. We are planning to extend this RMI service for debugging
(e.g., add calls to reboot/hang the machine). If such a service seems
useful to you as well, then when time permits, I will cleanup the code
and submit it to ZK.

4. I have a few suggestions to 

[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925746#action_12925746
 ] 

Flavio Junqueira commented on ZOOKEEPER-914:


As Pat, I would also appreciate some more constructive comments (and behavior). 

From the Clover reports, we exercise a significant part of the QCM code, but 
it is true, though, that we don't test the cases you have been exposing. Here 
is a way I believe we can reproduce this problem (I haven't implemented it, 
but seems to make sense). The high-level idea is to make sure that if some 
server stops responding before it completes the handshake protocol, then no 
instance of QCM across all servers will block and prevent other servers from 
joining the ensemble.

Suppose we configure an ensemble with 5 servers using QuorumBase. One of the 
servers will be a simple mock server, as we do in the CnxManagerTest tests. Now 
here is the sequence of steps to follow:

# Start three of the servers and confirm that they accept and execute 
operations;
# Start mock server and execute the protocol partially. For the read case you 
mention, you can simply not send the server identifier. That will cause the 
read on the other end to block and to not accept more connections;
# Start a 5th server and check if it is able to join the ensemble.

A simple fix to have it working for you soon along the lines of what we have 
done to make the connection timeout configurable seems to be to set SO_TIMEOUT. 
But, if you have other ideas, please lay them out. Please bear in mind that the 
major modifications we should leave for ZOOKEEPER-901 because those will take 
more time to develop and get into shape.

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925843#action_12925843
 ] 

Patrick Hunt commented on ZOOKEEPER-914:


Flavio in item 2 you mention mock, consider using mockito, I've had alot of 
luck with that personally, also Hadoop itself has moved to using this in it's 
tests. http://code.google.com/p/mockito/

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-27 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925546#action_12925546
 ] 

Patrick Hunt commented on ZOOKEEPER-914:


Thanks for the bug report. I've yet to find a codebase where I couldn't find 
what I consider bad programming, so I don't find that a constructive comment. 
We're happy you've joined community, let's all work together to address these 
issues. Thanks.

bq. points out to lack of failure tests for QuorumCnxManager

We can always use more testing. If you want to contribute additional patches 
just for testing please do so (I'm sure if you talk with Flavio he could give 
you some good ideas). Notice that there are a number of tests exercising this 
code already (around 85% coverage), we'd need to figure out some way to 
simulate network failures and such, which is difficult in my experience:
https://hudson.apache.org/hudson/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk/clover/org/apache/zookeeper/server/quorum/QuorumCnxManager.html

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker

 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.