[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-29 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926343#action_12926343
 ] 

Vishal K commented on ZOOKEEPER-914:


Hi Pat, Flavio,

I will begin with admitting that my comment about bad programming
was not a constructive comment and unwarranted. One can argue that such
a comment can be viewed as constructive since it raises a red alert on
the quality. But I understand that this is highly subjective, and hence,
should be avoided. However, I stand corrected for my comment about
lack of testing/tests.

I would like to take a moment here to explain our frustrations and
then I will get back to this bug and my suggestions to improve
QuorumCnxManager and testing. While reading the first part of my
comments please ask yourself why were these issues not uncovered
prior to checkins? rather than why is this guy complaining?.
You may find these to be constructive as well.
Finally, I would like to point out that for most of the
issues listed below I have tried to help by debugging and/or providing
patches. Also, we are interested in and will continue to contribute to
ZooKeeper.

We wrote an application on  top of ZooKeeper. We started testing
our application to see how it handles failures. We rebooted the
follower and we immediately  ran into ZOOKEEPER-335 (zookeeper
servers should commit the new leader txn to their logs.). We then
tried to reboot the leader and we ran into several bugs reported in
ZOOKEEPER-822 (Leader election taking a long time to complete). Once,
we misconfigured one of our ZooKeeper servers and we ran into bug
ZOOKEEPER-851 (ZK lets any node to become an observer). We made a
minor change to our client code and we ran into bug ZOOKEEPER-907
(KeeperErrorCode = Session moved messages), which also happens to
identify ZOOKEEPER-915 (Errors that happen during sync() processing at
the leader do not get propagated back to the client). A few days back
we rebooted our follower and ran into ZOOKEEPER-914 (QuorumCnxManager
blocks forever). There are a few other issues that I haven't reported
yet (still debugging).

Looking at the reported bugs, I believe that
almost all of them fall under sanity/basic failure testing
category. Therefore, if you look at it from our view, clover reports
and arguments about the number of tests that cover the code path in
question are great, but are not convincing. Anyways, now I will
conclude my end of the argument and move forward to look at the real issues
at hand. Hopefully, you will find the comments below to be constructive.

Moving on...

1. AFAIK, SO_TIMEOUT does not work for blocking channels. Is there a
way to set timeout on blocking channels? If not, we will have to use
non-blocking channels and then make sure that we handle read/write
correctly, because a read/write can return partial results or
non-blocking channels. I noticed that Learner.java uses
BinaryInputArchive from Jute in non-blocking mode. Should we use that?
Also note that QorumCnxManager after accepting connection reads the
first 8 bytes from the channel buffer and assumes that it is a server
ID. It does not have a tag to indicate packet/request type.

2. We could put a hack to timeout calls in receiveConnection and
InitiateConnection using TimerTask (start a timer and interrupt of
read hasn't returned after the timer expires) or Threads. But I would
rather go for the real fix.

3. Testing failures - Flavio, in addition to handshake protocol, we
will need to test failures post handshake (see initiateConnnction) to
ensure that a server does not block while writing if the receiver is
down. We need a way to introduce faults in the code. At my earlier
job, when we implemented a clustered system, we had a way to write
some form of assert statements in our code. While writing the code we
would put asserts and critical places. We could then enable these
asserts (using the assert name) in our tests and trigger
faults. Asserts could be used only in debug mode. In addition, we had
assert actions, which could essentially execute a specified method
(action). We introduced faults usin these these methods. This was done
using propriety library written in C. I am fairly new to the Java
world, but I am guessing there is a tool to do something similar
(maybe mockito?).

Also, in addition, to the  failure tests, we should periodically do real
failure testing. For example, rebootingnodes. In our experience, such
testing introduces unexpected latencies (e.g., exposes code to TCP
timeouts).

In our application, we have a RMI server that does management of
ZooKeeper (start/stop/etc) in addition to other management tasks for our
application. We are planning to extend this RMI service for debugging
(e.g., add calls to reboot/hang the machine). If such a service seems
useful to you as well, then when time permits, I will cleanup the code
and submit it to ZK.

4. I have a few suggestions to 

[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-29 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926404#action_12926404
 ] 

Benjamin Reed commented on ZOOKEEPER-907:
-

may i propose accepting this patch without a test case? (we can see that it 
fixes the problem.) that way we can get 3.3.2 out. once ZOOKEEPER-915 goes it 
the tests should cover this issue.

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



server logs an 'Unexpected Exception' error on any client close?

2010-10-29 Thread Nicholas Harteau
For a while I thought this was the perl client not calling zookeeper_close() 
properly, but I can reproduce this using the example c_client code from the 
twiki as well.  
(http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperProgrammers.html#ZooKeeper+C+client+API)

I thought I'd check here before filing a JIRA; am I missing something obvious?

Anytime a client disconnects from my quorum servers, even when 
zookeeper_close() has been called, I get this:

2010-10-29 18:04:44,454 - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:18121:nioserverc...@633] - 
EndOfStreamException: Unable to read additional data from client sessionid 
0x12bf93ca6a70041, likely client has closed socket
2010-10-29 18:04:44,454 - INFO  [ProcessThread:-1:preprequestproces...@385] - 
Processed session termination for sessionid: 0x12bf93ca6a70041
2010-10-29 18:04:44,454 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:18121:nioserverc...@1434] - Closed socket 
connection for client /fe80:0:0:0:0:0:0:1%1:62472 which had sessionid 
0x12bf93ca6a70041
2010-10-29 18:04:44,457 - ERROR [SyncThread:0:nioserverc...@444] - Unexpected 
Exception: 
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at 
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:417)
at 
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1508)
at 
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
at 
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:161)
at 
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:98)

I looked through list archives and such, most other occurrences of this have to 
do with some other error (bad timeouts, etc).  In this case I'm closing the 
connection properly and the cluster is working just fine.

(3.3.1 on darwin and rhel4, btw)

--
nicholas harteau
n...@ikami.com