[jira] Created: (ZOOKEEPER-901) Redesign of QuorumCnxManager

2010-10-17 Thread Flavio Junqueira (JIRA)
Redesign of QuorumCnxManager


 Key: ZOOKEEPER-901
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-901
 Project: Zookeeper
  Issue Type: Improvement
  Components: leaderElection
Affects Versions: 3.3.1
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.4.0


QuorumCnxManager manages TCP connections between ZooKeeper servers for leader 
election in replicated mode. We have identified over time a couple of 
deficiencies that we would like to fix. Unfortunately, fixing these issues 
requires a little more than just generating a couple of small patches. More 
specifically, I propose, based on previous discussions with the community, that 
we reimplement QuorumCnxManager so that we achieve the following:

# Establishing connections should not be a blocking operation, and perhaps even 
more important, it shouldn't prevent the establishment of connections with 
other servers;
# Using a pair of threads per connection is a little messy, and we have seen 
issues over time due to the creation and destruction of such threads. A more 
reasonable approach is to have a single thread and a selector.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Testing for Failure in the Cloud: FATE and DESTINI

2010-10-17 Thread Patrick Hunt
research that produces real tools, which help developers find (and
then fix) real failure-handling bugs, including 16 new bug reports to
HDFS (7 design bugs and 9 implementation bugs). Pretty nice, given the
intricacies of failure-recovery protocols.

Has anyone heard of this? First time for me, he mentions some
preliminary results with ZK, but I've yet to hear anything:

http://databeta.wordpress.com/2010/10/15/testing-for-failure-in-the-cloud-fate-and-destini/

Patrick


Re: Restarting discussion on ZooKeeper as a TLP

2010-10-17 Thread Patrick Hunt
Good to see we are in agreement on this. Thanks everyone who voted. Looks
like this is unanimous at this point. I will
start the proceedings in the Hadoop PMC to make ZooKeeper a TLP.

Patrick

On Thu, Oct 14, 2010 at 5:37 PM, Flavio Junqueira f...@yahoo-inc.com wrote:

 +1. Frankly, I don't see concretes benefits for the community with
 ZooKeeper becoming a TLP, but perhaps it will become clear over time. Now it
 is certainly cool to have our own top-level domain:
 http://zookeeper.apache.org/ rocks!

 -Flavio

 On Oct 14, 2010, at 1:00 PM, Benjamin Reed wrote:

  +1

 ben

 On 10/14/2010 11:47 AM, Henry Robinson wrote:

 +1,


 I agree that we've addressed most outstanding concerns, we're ready for

 TLP.


 Henry


 On 14 October 2010 13:29, Mahadev Konarmaha...@yahoo-inc.com  wrote:


 +1 for moving to TLP.


 Thanks for starting the vote Pat.


 mahadev



 On 10/13/10 2:10 PM, Patrick Huntph...@apache.org  wrote:


 In March of this year we discussed a request from the Apache Board, and

 Hadoop PMC, that we become a TLP rather than a subproject of Hadoop:


 Original discussion

 http://markmail.org/thread/42cobkpzlgotcbin


 I originally voted against this move, my primary concern being that we

 were

 not ready to move to tlp status given our small contributor base and

 limited contributor diversity. However I'd now like to revisit that

 discussion/decision. Since that time the team has been working hard to

 attract new contributors, and we've seen significant new contributions

 come

 in. There has also been feedback from board/pmc addressing many of these

 concerns (both on the list and in private). I am now less concerned about

 this issue and don't see it as a blocker for us to move to TLP status.


 A second concern was that by becoming a TLP the project would lose it's

 connection with Hadoop, a big source of new users for us. I've been

 assured

 (and you can see with the other projects that have moved to tlp status;

 pig/hive/hbase/etc...) that this connection will be maintained. The

 Hadoop

 ZooKeeper tab for example will redirect to our new homepage.


 Other Apache members also pointed out to me that we are essentially

 operating as a TLP within the Hadoop PMC. Most of the other PMC members

 have

 little or no experience with ZooKeeper and this makes it difficult for

 them

 to monitor and advise us. By moving to TLP status we'll be able to govern

 ourselves and better set our direction.


 I believe we are ready to become a TLP. Please respond to this email with

 your thoughts and any issues. I will call a vote in a few days, once

 discussion settles.


 Regards,


 Patrick






 *flavio*
 *junqueira*

 research scientist

 f...@yahoo-inc.com
 direct +34 93-183-8828

 avinguda diagonal 177, 8th floor, barcelona, 08018, es
 phone (408) 349 3300fax (408) 349 3301





[jira] Commented: (ZOOKEEPER-901) Redesign of QuorumCnxManager

2010-10-17 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921921#action_12921921
 ] 

Patrick Hunt commented on ZOOKEEPER-901:


Thoughts regarding netty support? We've been adding netty support to the 
client-server connection mechanisms. My intent was to eventually modify the 
server-server connections (quorum/election) similarly. You might want to 
consider this when refactoring -- either adding directly or just making sure it 
will be easy(ier) to add netty eventually.

 Redesign of QuorumCnxManager
 

 Key: ZOOKEEPER-901
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-901
 Project: Zookeeper
  Issue Type: Improvement
  Components: leaderElection
Affects Versions: 3.3.1
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.4.0


 QuorumCnxManager manages TCP connections between ZooKeeper servers for leader 
 election in replicated mode. We have identified over time a couple of 
 deficiencies that we would like to fix. Unfortunately, fixing these issues 
 requires a little more than just generating a couple of small patches. More 
 specifically, I propose, based on previous discussions with the community, 
 that we reimplement QuorumCnxManager so that we achieve the following:
 # Establishing connections should not be a blocking operation, and perhaps 
 even more important, it shouldn't prevent the establishment of connections 
 with other servers;
 # Using a pair of threads per connection is a little messy, and we have seen 
 issues over time due to the creation and destruction of such threads. A more 
 reasonable approach is to have a single thread and a selector.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (ZOOKEEPER-901) Redesign of QuorumCnxManager

2010-10-17 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921921#action_12921921
 ] 

Patrick Hunt edited comment on ZOOKEEPER-901 at 10/17/10 7:21 PM:
--

Thoughts regarding netty support? We've been adding netty support to the client 
 -  server connection mechanisms. My intent was to eventually modify the 
server  -  server connections (quorum/election) similarly. You might want to 
consider this when refactoring -- either adding directly or just making sure it 
will be easy(ier) to add netty eventually.

  was (Author: phunt):
Thoughts regarding netty support? We've been adding netty support to the 
client-server connection mechanisms. My intent was to eventually modify the 
server-server connections (quorum/election) similarly. You might want to 
consider this when refactoring -- either adding directly or just making sure it 
will be easy(ier) to add netty eventually.
  
 Redesign of QuorumCnxManager
 

 Key: ZOOKEEPER-901
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-901
 Project: Zookeeper
  Issue Type: Improvement
  Components: leaderElection
Affects Versions: 3.3.1
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.4.0


 QuorumCnxManager manages TCP connections between ZooKeeper servers for leader 
 election in replicated mode. We have identified over time a couple of 
 deficiencies that we would like to fix. Unfortunately, fixing these issues 
 requires a little more than just generating a couple of small patches. More 
 specifically, I propose, based on previous discussions with the community, 
 that we reimplement QuorumCnxManager so that we achieve the following:
 # Establishing connections should not be a blocking operation, and perhaps 
 even more important, it shouldn't prevent the establishment of connections 
 with other servers;
 # Using a pair of threads per connection is a little messy, and we have seen 
 issues over time due to the creation and destruction of such threads. A more 
 reasonable approach is to have a single thread and a selector.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed

2010-10-17 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921923#action_12921923
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-804:
---

+1.

 I can open a new bug and submit a patch that way if its preferred. 

No worry, it's not a big deal since this is a one line change. 

Thanks again, Jared!
--Michi

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804-1.patch, ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: What's the QA strategy of ZooKeeper?

2010-10-17 Thread Patrick Hunt
Hi Vishal, thanks for the list. As you can see when we do find issues we do
our best to address them and increase testing in that area. Unfortunately
our testing regime, while extensive is not exhaustive. You can see the
clover coverage reports here btw:
https://hudson.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/clover/

We'd love to see further contributions around testing. Thomas has opened
some discussion around code refactoring, and I'm hopeful that will increase
the coverage and enable design for test which we lack in some cases.

Patrick

On Fri, Oct 15, 2010 at 12:24 PM, Vishal K vishalm...@gmail.com wrote:

 Hi Patrick,

 On Fri, Oct 15, 2010 at 2:22 PM, Patrick Hunt ph...@apache.org wrote:

   Recently, we have ran into issues in ZK that I believe should have
 caught
  by some basic testing before the release
 
  Vishal, can you be more specific, point out specific JIRAs that you
 entered
  would be very valuable. Don't worry about hurting our feelings or
 anything,
  without this type of feedback we can't address the specific issues and
  their
  underlying problems.
 
 
 Heres a list of few issues:
 Leader election taking a long time  to complete -
 https://issues.apache.org/jira/browse/ZOOKEEPER-822
 Last processed zxid set prematurely while establishing leadership -
 https://issues.apache.org/jira/browse/ZOOKEEPER-790
 FLE implementation should be improved to use non-blocking sockets
 ZOOKEEPER-900
 ZK lets any node to become an observer -
 https://issues.apache.org/jira/browse/ZOOKEEPER-851


  Regards,
 
  Patrick
 
  On Fri, Oct 15, 2010 at 11:14 AM, Mahadev Konar maha...@yahoo-inc.com
  wrote:
 
   Well said Vishal.
  
   I really like the points you put forth!!!
  
   Agree on all the points, but again, all the point you mention require
   commitment from folks like you. Its a pretty hard task to test all the
   corner cases of ZooKeeper. I'd expect everyone to pitch in for testing
 a
   release. We should definitely work towards a plan. You should go ahead
  and
   create a jira for the QA plan. We should all pitch in with what all
  should
   be tested.
  
   Thanks
   mahadev
  
   On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote:
  
Hi,
   
I would like to add my few cents here.
   
I would suggest to stay away from code cleanup unless it is
 absolutely
necessary.
   
I would also like to extend this discussion to understand the amount
 of
testing/QA to be performed before a release. How do we currently
  qualify
   a
release?
   
Recently, we have ran into issues in ZK that I believe should have
  caught
   by
some basic testing before the release. I will be honest in saying
 that,
unfortunately, these bugs have resulted in questions being raised by
   several
people in our organization about our choice of using ZooKeeper.
Nevertheless, our product group really thinks that ZK is a cool
   technology,
but we need to focus on making it robust before adding major new
  features
   to
it.
   
I would suggest to:
1. Look at current bugs and see why existing test did not uncover
 these
   bugs
and improve those tests.
2. Look at places that need more tests and broadcast it to the
  community.
Follow-up with test development.
3. Have a crisp release QA strategy for each release.
4. Improve API documentation as well as code documentation so that
 the
   API
usage is clear and debugging is made easier.
   
Comments?
   
Thanks.
-Vishal
   
On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:
   
Hi Benjamin,
   
thank you for your response. Please find some comments inline.
   
Benjamin Reed:
  code quality is important, and there are things we should keep in
mind, but in general i really don't like the idea of risking code
breakage because of a gratuitous code cleanup. we should be
 watching
   out
for these things when patches get submitted or when new things go
 in.
I didn't want to say it that clear, but especially the new Netty
 code,
   both
on
client and server side is IMHO an example of new code in very bad
  shape.
The
client code patch even changes the FindBugs configuration to exclude
  the
new
code from the FindBugs checks.
   
i think this is inline with what pat was saying. just to expand a
  bit.
in my opinion clean up refactorings have the following problems:
   
1) you risk breaking things in production for a potential future
maintenance advantage.
If your code is already in such a bad shape, that every change
  includes
considerable risk to break something, then you already are in
 trouble.
   With
every new feature (or bugfix!) you also risk to break something.
If you don't have the attitude of permanent refactoring to improve
 the
   code
quality, you will inevitably lower the maintainability of your code
  with
every
new feature. New 

[jira] Updated: (ZOOKEEPER-820) update c unit tests to ensure zombie java server processes don't cause failure

2010-10-17 Thread Michi Mutsuzaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michi Mutsuzaki updated ZOOKEEPER-820:
--

Attachment: ZOOKEEPER-820.patch

Uses which to check if lsof command is present. If it is, use it to see if 
there is a process listening on port 22181 and kill it. 

--Michi

 update c unit tests to ensure zombie java server processes don't cause 
 failure
 

 Key: ZOOKEEPER-820
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, 
 ZOOKEEPER-820.patch, ZOOKEEPER-820.patch


 When the c unit tests are run sometimes the server doesn't shutdown at the 
 end of the test, this causes subsequent tests (hudson esp) to fail.
 1) we should try harder to make the server shut down at the end of the test, 
 I suspect this is related to test failing/cleanup
 2) before the tests are run we should see if the old server is still running 
 and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-820) update c unit tests to ensure zombie java server processes don't cause failure

2010-10-17 Thread Michi Mutsuzaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michi Mutsuzaki updated ZOOKEEPER-820:
--

Status: Patch Available  (was: Open)

 update c unit tests to ensure zombie java server processes don't cause 
 failure
 

 Key: ZOOKEEPER-820
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, 
 ZOOKEEPER-820.patch, ZOOKEEPER-820.patch


 When the c unit tests are run sometimes the server doesn't shutdown at the 
 end of the test, this causes subsequent tests (hudson esp) to fail.
 1) we should try harder to make the server shut down at the end of the test, 
 I suspect this is related to test failing/cleanup
 2) before the tests are run we should see if the old server is still running 
 and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Running a single unit test

2010-10-17 Thread Michi Mutsuzaki
Hello,

How do I run a single unit test? I tried this:

$ ant test -Dtest=SessionTest

but it still runs all the tests.

Thanks!
--Michi



[jira] Commented: (ZOOKEEPER-794) Callbacks are not invoked when the client is closed

2010-10-17 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921939#action_12921939
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-794:
---

ZOOKEEPER-794_5.patch.txt doesn't compile. I'm getting these errors:

[javac] 
branch-3.3/src/java/test/org/apache/zookeeper/test/SessionTest.java:201: cannot 
find symbol
[javac] symbol : variable Assert
[javac] location: class org.apache.zookeeper.test.SessionTest
[javac] Assert.fail(Should have received a SessionExpiredException);
[javac] ^
[javac] 
branch-3.3/src/java/test/org/apache/zookeeper/test/SessionTest.java:217: cannot 
find symbol
[javac] symbol : variable Assert
[javac] location: class org.apache.zookeeper.test.SessionTest
[javac] Assert.assertEquals(KeeperException.Code.SESSIONEXPIRED.toString(), 
cb.toString());
[javac] ^

We need to either:

a. import org.junit.Assert, or
b. Use fail/assertEquals instead of Assert.fail/Assert.assertEquals.

--Michi

 Callbacks are not invoked when the client is closed
 ---

 Key: ZOOKEEPER-794
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-794
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.1
Reporter: Alexis Midon
Assignee: Alexis Midon
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-794.patch.txt, ZOOKEEPER-794.txt, 
 ZOOKEEPER-794_2.patch, ZOOKEEPER-794_3.patch, ZOOKEEPER-794_4.patch.txt, 
 ZOOKEEPER-794_5.patch.txt


 I noticed that ZooKeeper has different behaviors when calling synchronous or 
 asynchronous actions on a closed ZooKeeper client.
 Actually a synchronous call will throw a session expired exception while an 
 asynchronous call will do nothing. No exception, no callback invocation.
 Actually, even if the EventThread receives the Packet with the session 
 expired err code, the packet is never processed since the thread has been 
 killed by the ventOfDeath. So the call back is not invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Running a single unit test

2010-10-17 Thread Henry Robinson
You need to use -Dtestcase, not -Dtest, as per below:

ant test -Dtestcase=YourTestHere

HTH,

Henry

On 17 October 2010 17:34, Michi Mutsuzaki mic...@yahoo-inc.com wrote:

 Hello,

 How do I run a single unit test? I tried this:

 $ ant test -Dtest=SessionTest

 but it still runs all the tests.

 Thanks!
 --Michi




-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679