[jira] Commented: (ZOOKEEPER-912) ZooKeeper client logs trace and debug messages at level INFO

2010-10-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925719#action_12925719
 ] 

Patrick Hunt commented on ZOOKEEPER-912:


bq. Zookeeper logs nearly everything at least at level info, regardless of 
severity.

that's incorrect. I did a quick grep for logging in the main src and see the 
following:

egrep -R LOG\.error src/java/main/. |wc -l
78
egrep -R LOG\.warn src/java/main/. |wc -l
175
egrep -R LOG\.info src/java/main/. |wc -l
127
egrep -R LOG\.debug src/java/main/. |wc -l
114
egrep -R LOG\.trace src/java/main/. |wc -l
28

So actually we log mostly at WARN severity. Perhaps you think this because you 
mainly see INFO messages, but that's to be expected (typically things work, we 
only log WARN/ERROR when bad things happen).

I didn't say anything about all/nothing. Check the code, we have a number of 
messages at various levels, incl trace/debug. If you don't want to see the 
informational messages for a particular class you can configure that.

As I pointed out earlier:
http://hadoop.apache.org/zookeeper/docs/current/zookeeperInternals.html#sc_logging

We consider both the messages you listed to be informational given that we 
expect/recover from the second.


 ZooKeeper client logs trace and debug messages at level INFO
 

 Key: ZOOKEEPER-912
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-912
 Project: Zookeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.3.1
Reporter: Anthony Urso
Assignee: Anthony Urso
Priority: Minor
 Fix For: 3.4.0

 Attachments: zk-loglevel.patch


 ZK logs a lot of uninformative trace and debug messages to level INFO.  This 
 fuzzes up everything and makes it easy to miss useful log info. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-914:
---

Component/s: (was: server)
 (was: quorum)
 leaderElection

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925746#action_12925746
 ] 

Flavio Junqueira commented on ZOOKEEPER-914:


As Pat, I would also appreciate some more constructive comments (and behavior). 

From the Clover reports, we exercise a significant part of the QCM code, but 
it is true, though, that we don't test the cases you have been exposing. Here 
is a way I believe we can reproduce this problem (I haven't implemented it, 
but seems to make sense). The high-level idea is to make sure that if some 
server stops responding before it completes the handshake protocol, then no 
instance of QCM across all servers will block and prevent other servers from 
joining the ensemble.

Suppose we configure an ensemble with 5 servers using QuorumBase. One of the 
servers will be a simple mock server, as we do in the CnxManagerTest tests. Now 
here is the sequence of steps to follow:

# Start three of the servers and confirm that they accept and execute 
operations;
# Start mock server and execute the protocol partially. For the read case you 
mention, you can simply not send the server identifier. That will cause the 
read on the other end to block and to not accept more connections;
# Start a 5th server and check if it is able to join the ensemble.

A simple fix to have it working for you soon along the lines of what we have 
done to make the connection timeout configurable seems to be to set SO_TIMEOUT. 
But, if you have other ideas, please lay them out. Please bear in mind that the 
major modifications we should leave for ZOOKEEPER-901 because those will take 
more time to develop and get into shape.

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-885) Zookeeper drops connections under moderate IO load

2010-10-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925754#action_12925754
 ] 

Flavio Junqueira commented on ZOOKEEPER-885:


Sure, let's discuss over e-mail and we can post here later our findings. 

 Zookeeper drops connections under moderate IO load
 --

 Key: ZOOKEEPER-885
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.2, 3.3.1
 Environment: Debian (Lenny)
 1Gb RAM
 swap disabled
 100Mb heap for zookeeper
Reporter: Alexandre Hardy
Priority: Critical
 Attachments: benchmark.csv, tracezklogs.tar.gz, tracezklogs.tar.gz, 
 WatcherTest.java, zklogs.tar.gz


 A zookeeper server under minimum load, with a number of clients watching 
 exactly one node will fail to maintain the connection when the machine is 
 subjected to moderate IO load.
 In a specific test example we had three zookeeper servers running on 
 dedicated machines with 45 clients connected, watching exactly one node. The 
 clients would disconnect after moderate load was added to each of the 
 zookeeper servers with the command:
 {noformat}
 dd if=/dev/urandom of=/dev/mapper/nimbula-test
 {noformat}
 The {{dd}} command transferred data at a rate of about 4Mb/s.
 The same thing happens with
 {noformat}
 dd if=/dev/zero of=/dev/mapper/nimbula-test
 {noformat}
 It seems strange that such a moderate load should cause instability in the 
 connection.
 Very few other processes were running, the machines were setup to test the 
 connection instability we have experienced. Clients performed no other read 
 or mutation operations.
 Although the documents state that minimal competing IO load should present on 
 the zookeeper server, it seems reasonable that moderate IO should not cause 
 problems in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever

2010-10-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925843#action_12925843
 ] 

Patrick Hunt commented on ZOOKEEPER-914:


Flavio in item 2 you mention mock, consider using mockito, I've had alot of 
luck with that personally, also Hadoop itself has moved to using this in it's 
tests. http://code.google.com/p/mockito/

 QuorumCnxManager blocks forever 
 

 Key: ZOOKEEPER-914
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.3, 3.4.0


 This was a disaster. While testing our application we ran into a scenario 
 where a rebooted follower could not join the cluster. Further debugging 
 showed that the follower could not join because the QuorumCnxManager on the 
 leader was blocked for indefinite amount of time in receiveConnect()
 Thread-3 prio=10 tid=0x7fa920005800 nid=0x11bb runnable 
 [0x7fa9275ed000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 - locked 0x7fa93315f988 (a java.lang.Object)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
 at 
 org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
 I had pointed out this bug along with several other problems in 
 QuorumCnxManager earlier in 
 https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
 https://issues.apache.org/jira/browse/ZOOKEEPER-822.
 I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
 and a patch will be out soon. 
 The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
 It does a read() in receiveConnection() and a write() in initiateConnection().
 Sorry, but this is really bad programming. Also, points out to lack of 
 failure tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-912) ZooKeeper client logs trace and debug messages at level INFO

2010-10-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925845#action_12925845
 ] 

Patrick Hunt commented on ZOOKEEPER-912:


Hi Anthony, I realized in the shower this morning, by Zookeeper did you mean 
ZooKeeper.java? My bad.

I looked at this class again and it does have logging at other levels than just 
info. Really it should have trace level logs for each of the api entry points. 
I'm concerned about pushing down the info level logs you highlighted though due 
to a couple factors; 1) in our experience those msgs are very useful to 
understand the runtime state of the client, 2) many users don't run in 
production at trace (and some don't want to run in debug).

What's your rule of thumb for what should be logged at the various levels?

 ZooKeeper client logs trace and debug messages at level INFO
 

 Key: ZOOKEEPER-912
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-912
 Project: Zookeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.3.1
Reporter: Anthony Urso
Assignee: Anthony Urso
Priority: Minor
 Fix For: 3.4.0

 Attachments: zk-loglevel.patch


 ZK logs a lot of uninformative trace and debug messages to level INFO.  This 
 fuzzes up everything and makes it easy to miss useful log info. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-897) C Client seg faults during close

2010-10-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925846#action_12925846
 ] 

Patrick Hunt commented on ZOOKEEPER-897:


perhaps we should rely on existing testing for this one, but enter a new jira 
to refactor the client, specifically to allow testing? (ie a way to inject the 
helper code w/o needing to edit zookeeper.c directly)

 C Client seg faults during close
 

 Key: ZOOKEEPER-897
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch


 We observed a crash while closing our c client.  It was in the do_io() thread 
 that was processing as during the close() call.
 #0  queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
 #1  0x0046234e in check_events (zh=0x6bd480, events=value optimized 
 out) at src/zookeeper.c:1687
 #2  0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at 
 src/zookeeper.c:1971
 #3  0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
 #4  0x77bc59ca in start_thread () from /lib/libpthread.so.0
 #5  0x76f706fd in clone () from /lib/libc.so.6
 #6  0x in ?? ()
 We tracked down the sequence of events, and the cause is that input_buffer is 
 being freed from a thread other than the do_io thread that relies on it:
 1. do_io() call check_events()
 2. if(eventsZOOKEEPER_READ) branch executes
 3. if (rc  0) branch executes
 4. if (zh-input_buffer != zh-primer_buffer) branch executes
 .in the meantime..
  5. zookeeper_close() called
  6. if (inc_ref_counter(zh,0)!=0) branch executes
  7. cleanup_bufs() is called
  8. input_buffer is freed at the end
 . back to check_events().
 9. queue_events() is called on a NULL buffer.
 I believe the patch is to only call free_completions() in zookeeper_close() 
 and not cleanup_bufs().  The original reason cleanup_bufs() was added was to 
 call any outstanding synhcronous completions, so only free_completions (which 
 is guarded) is needed.  I will submit a patch for review with this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-897) C Client seg faults during close

2010-10-28 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925858#action_12925858
 ] 

Mahadev konar commented on ZOOKEEPER-897:
-

jared, pat,
 I am ok without a test case for this one, because its a quite hard to create 
one. I just wanted someone else to run the tests on there machines just to 
verify (since I rarely see any problems in c tests on my machine). I will go 
ahead and commit this patch for now.


 C Client seg faults during close
 

 Key: ZOOKEEPER-897
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch


 We observed a crash while closing our c client.  It was in the do_io() thread 
 that was processing as during the close() call.
 #0  queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
 #1  0x0046234e in check_events (zh=0x6bd480, events=value optimized 
 out) at src/zookeeper.c:1687
 #2  0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at 
 src/zookeeper.c:1971
 #3  0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
 #4  0x77bc59ca in start_thread () from /lib/libpthread.so.0
 #5  0x76f706fd in clone () from /lib/libc.so.6
 #6  0x in ?? ()
 We tracked down the sequence of events, and the cause is that input_buffer is 
 being freed from a thread other than the do_io thread that relies on it:
 1. do_io() call check_events()
 2. if(eventsZOOKEEPER_READ) branch executes
 3. if (rc  0) branch executes
 4. if (zh-input_buffer != zh-primer_buffer) branch executes
 .in the meantime..
  5. zookeeper_close() called
  6. if (inc_ref_counter(zh,0)!=0) branch executes
  7. cleanup_bufs() is called
  8. input_buffer is freed at the end
 . back to check_events().
 9. queue_events() is called on a NULL buffer.
 I believe the patch is to only call free_completions() in zookeeper_close() 
 and not cleanup_bufs().  The original reason cleanup_bufs() was added was to 
 call any outstanding synhcronous completions, so only free_completions (which 
 is guarded) is needed.  I will submit a patch for review with this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-897) C Client seg faults during close

2010-10-28 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-897:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this. thanks jared!

 C Client seg faults during close
 

 Key: ZOOKEEPER-897
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch


 We observed a crash while closing our c client.  It was in the do_io() thread 
 that was processing as during the close() call.
 #0  queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
 #1  0x0046234e in check_events (zh=0x6bd480, events=value optimized 
 out) at src/zookeeper.c:1687
 #2  0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at 
 src/zookeeper.c:1971
 #3  0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
 #4  0x77bc59ca in start_thread () from /lib/libpthread.so.0
 #5  0x76f706fd in clone () from /lib/libc.so.6
 #6  0x in ?? ()
 We tracked down the sequence of events, and the cause is that input_buffer is 
 being freed from a thread other than the do_io thread that relies on it:
 1. do_io() call check_events()
 2. if(eventsZOOKEEPER_READ) branch executes
 3. if (rc  0) branch executes
 4. if (zh-input_buffer != zh-primer_buffer) branch executes
 .in the meantime..
  5. zookeeper_close() called
  6. if (inc_ref_counter(zh,0)!=0) branch executes
  7. cleanup_bufs() is called
  8. input_buffer is freed at the end
 . back to check_events().
 9. queue_events() is called on a NULL buffer.
 I believe the patch is to only call free_completions() in zookeeper_close() 
 and not cleanup_bufs().  The original reason cleanup_bufs() was added was to 
 call any outstanding synhcronous completions, so only free_completions (which 
 is guarded) is needed.  I will submit a patch for review with this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-10-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-702:
---

Status: Open  (was: Patch Available)

Hi Abmar, Thanks for the addition to the patch. I was wondering if it is really 
a good idea to have both options, normal and exponential, implemented. Since 
your experiments have shown that exponential performs better, why don't use it 
only? Also, I was wondering if you have posted expertimental numbers showing 
that exponential performs better. 

In the case we go with exponential only, then we don't need the modification to 
ivy.xml, right?

And last comment, it doesn't look like the classes implementing 
PhiTimeoutEvaluator need to be public. Is this right?  

 GSoC 2010: Failure Detector Model
 -

 Key: ZOOKEEPER-702
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
 Project: Zookeeper
  Issue Type: Wish
Reporter: Henry Robinson
Assignee: Abmar Barros
 Fix For: 3.4.0

 Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
 chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
 ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch


 Failure Detector Module
 Possible Mentor
 Henry Robinson (henry at apache dot org)
 Requirements
 Java, some distributed systems knowledge, comfort implementing distributed 
 systems protocols
 Description
 ZooKeeper servers detects the failure of other servers and clients by 
 counting the number of 'ticks' for which it doesn't get a heartbeat from 
 other machines. This is the 'timeout' method of failure detection and works 
 very well; however it is possible that it is too aggressive and not easily 
 tuned for some more unusual ZooKeeper installations (such as in a wide-area 
 network, or even in a mobile ad-hoc network).
 This project would abstract the notion of failure detection to a dedicated 
 Java module, and implement several failure detectors to compare and contrast 
 their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
 phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
 is much more tunable and has some very interesting properties. This is a 
 great project if you are interested in distributed algorithms, or want to 
 help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-898) C Client might not cleanup correctly during close

2010-10-28 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925909#action_12925909
 ] 

Mahadev konar commented on ZOOKEEPER-898:
-

+1, good catch jared. I just committed this to 3.3 and trunk.

thanks!

 C Client might not cleanup correctly during close
 -

 Key: ZOOKEEPER-898
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Trivial
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch


 I was looking through the c-client code and noticed a situation where a 
 counter can be incorrectly incremented and a small memory leak can occur.
 In zookeeper.c : add_completion(), if close_requested is true, then the 
 completion will not be queued.  But at the end, outstanding_sync is still 
 incremented and free() never called on the newly allocated completion_list_t. 
  
 I will submit for review a diff that I believe corrects this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-898) C Client might not cleanup correctly during close

2010-10-28 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-898:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

 C Client might not cleanup correctly during close
 -

 Key: ZOOKEEPER-898
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Trivial
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch


 I was looking through the c-client code and noticed a situation where a 
 counter can be incorrectly incremented and a small memory leak can occur.
 In zookeeper.c : add_completion(), if close_requested is true, then the 
 completion will not be queued.  But at the end, outstanding_sync is still 
 incremented and free() never called on the newly allocated completion_list_t. 
  
 I will submit for review a diff that I believe corrects this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-517) NIO factory fails to close connections when the number of file handles run out.

2010-10-28 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-517:


Fix Version/s: (was: 3.3.2)
   3.3.3

moving this out to 3.3.3 and 3.4 for investigation.

 NIO factory fails to close connections when the number of file handles run 
 out.
 ---

 Key: ZOOKEEPER-517
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-517
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Reporter: Mahadev konar
Assignee: Benjamin Reed
Priority: Critical
 Fix For: 3.3.3, 3.4.0


 The code in NIO factory is such that if we fail to accept a connection due to 
 some reasons (too many file handles maybe one of them) we do not close the 
 connections that are in CLOSE_WAIT. We need to call an explicit close on 
 these sockets and then close them. One of the solutions might be to move doIO 
 before accpet so that we can still close connection even if we cannot accept 
 connections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-667) java client doesn't allow ipv6 numeric connect string

2010-10-28 Thread Gunnar Wagenknecht (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925923#action_12925923
 ] 

Gunnar Wagenknecht commented on ZOOKEEPER-667:
--

I'm wondering if there is something else to do for proper connecting on IPv6 
systems. I'm on Windows 7 with Java 6 and I do have the following exception in 
my logs.

When connecting I specified {{localhost:2181}}.
{noformat}
21:20:04.157 [Worker-0-SendThread()] INFO  org.apache.zookeeper.ClientCnxn - 
Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181
21:20:04.188 [Worker-0-SendThread(localhost:2181)] WARN  
org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected 
error, closing socket connection and attempting reconnect
java.net.SocketException: Address family not supported by protocol family: 
connect
at sun.nio.ch.Net.connect(Native Method) ~[na:1.6.0_21]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) 
~[na:1.6.0_21]
at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1009) 
~[na:na]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1036) 
~[na:na]
{noformat}


 java client doesn't allow ipv6 numeric connect string
 -

 Key: ZOOKEEPER-667
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-667
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.2.2
Reporter: Patrick Hunt
Assignee: Patrick Hunt
Priority: Critical
 Fix For: 3.3.0


 The java client doesn't handle ipv6 numeric addresses as they are colon (:) 
 delmited. After splitting the host/port on : we look for the port as the 
 second entry in the array rather than the last entry in the array.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer

2010-10-28 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925927#action_12925927
 ] 

Henry Robinson commented on ZOOKEEPER-851:
--

Hi Vishal - 

Sorry for the slow turnaround on this one. It doesn't surprise me that this is 
the behaviour, although it's slightly unexpected that the node becomes an 
observer, rather than a follower. What evidence do you have for that? (Given 
that Mode: follower - I haven't checked the code in a while, but I would have 
thought it would print Mode: Observer).

Henry

 ZK lets any node to become an observer
 --

 Key: ZOOKEEPER-851
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Critical
 Fix For: 3.4.0


 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as 
 show below:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I 
 created another entry for that node and started zk server. The zoo.cfg on the 
 first 3 nodes was left unchanged. The fourth node was able to join the 
 cluster even though the 3 nodes had no idea about the fourth node.
 zoo.cfg on fourth node:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 server.3=10.17.117.71:2888:3888
 It looks like 10.17.117.71 is becoming an observer in this case. I was 
 expecting that the leader will reject 10.17.117.71.
 # telnet 10.17.117.71 2181
 Trying 10.17.117.71...
 Connected to 10.17.117.71.
 Escape character is '^]'.
 stat
 Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT
 Clients:
  /10.17.117.71:37297[1](queued=0,recved=1,sent=0)
 Latency min/avg/max: 0/0/0
 Received: 3
 Sent: 2
 Outstanding: 0
 Zxid: 0x20065
 Mode: follower
 Node count: 288

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-28 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925976#action_12925976
 ] 

Benjamin Reed commented on ZOOKEEPER-907:
-

Ah, I see the problem. There are actually two problems: 1) when sync() get's an 
error it is not propagated back to the caller. 2) this problem.

They problem is that 1) is preventing us from writing a test case. We need to 
fix 1) and then we can write the test for 2).

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-915) Errors that happen during sync() processing at the leader do not get propagated back to the client.

2010-10-28 Thread Benjamin Reed (JIRA)
Errors that happen during sync() processing at the leader do not get propagated 
back to the client.
---

 Key: ZOOKEEPER-915
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-915
 Project: Zookeeper
  Issue Type: Bug
Reporter: Benjamin Reed


If an error in sync() processing happens at the leader (SESSION_MOVED for 
example), they are not propagated back to the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer

2010-10-28 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925997#action_12925997
 ] 

Vishal K commented on ZOOKEEPER-851:


I have not used observers yet. Looks like then I need to verify if this 
follower can become a leader.  I think I had tried this. If I remember, the 4th 
node did not become a leader. But this was a while back. I will try it again 
and update the jira.

 ZK lets any node to become an observer
 --

 Key: ZOOKEEPER-851
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Critical
 Fix For: 3.4.0


 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as 
 show below:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I 
 created another entry for that node and started zk server. The zoo.cfg on the 
 first 3 nodes was left unchanged. The fourth node was able to join the 
 cluster even though the 3 nodes had no idea about the fourth node.
 zoo.cfg on fourth node:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 server.3=10.17.117.71:2888:3888
 It looks like 10.17.117.71 is becoming an observer in this case. I was 
 expecting that the leader will reject 10.17.117.71.
 # telnet 10.17.117.71 2181
 Trying 10.17.117.71...
 Connected to 10.17.117.71.
 Escape character is '^]'.
 stat
 Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT
 Clients:
  /10.17.117.71:37297[1](queued=0,recved=1,sent=0)
 Latency min/avg/max: 0/0/0
 Received: 3
 Sent: 2
 Outstanding: 0
 Zxid: 0x20065
 Mode: follower
 Node count: 288

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-805) four letter words fail with latest ubuntu nc.openbsd

2010-10-28 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-805:
---

Fix Version/s: (was: 3.3.2)
   3.3.3

Not a blocker, pushing to 3.3.3/3.4.0

 four letter words fail with latest ubuntu nc.openbsd
 

 Key: ZOOKEEPER-805
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-805
 Project: Zookeeper
  Issue Type: Bug
  Components: documentation, server
Affects Versions: 3.3.1, 3.4.0
Reporter: Patrick Hunt
Priority: Critical
 Fix For: 3.3.3, 3.4.0


 In both 3.3 branch and trunk echo stat|nc localhost 2181 fails against the 
 ZK server on Ubuntu Lucid Lynx.
 I noticed this after upgrading to lucid lynx - which is now shipping openbsd 
 nc as the default:
 OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2)
 vs nc traditional
 [v1.10-38]
 which works fine. Not sure if this is a bug in us or nc.openbsd, but it's 
 currently not working for me. Ugh.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-815) fill in TBDs in overview doc

2010-10-28 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-815:
---

Fix Version/s: (was: 3.3.2)
   3.3.3

Not a blocker, pushing to 3.3.3/3.4.0.

 fill in TBDs in overview doc
 --

 Key: ZOOKEEPER-815
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-815
 Project: Zookeeper
  Issue Type: Bug
  Components: documentation
Affects Versions: 3.3.1
Reporter: Patrick Hunt
Priority: Minor
 Fix For: 3.3.3, 3.4.0


 Funny: Ephemeral nodes are useful when you want to implement [tbd]. there 
 are a few others in that doc that are should really be fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-10-28 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926004#action_12926004
 ] 

Mahadev konar commented on ZOOKEEPER-907:
-

ben, can is ZOOKEEPER-915 also marked for 3.3.2 then? 

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-898) C Client might not cleanup correctly during close

2010-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926008#action_12926008
 ] 

Hudson commented on ZOOKEEPER-898:
--

Integrated in ZooKeeper-trunk #983 (See 
[https://hudson.apache.org/hudson/job/ZooKeeper-trunk/983/])
ZOOKEEPER-898. C Client might not cleanup correctly during close (jared 
cantwell via mahadev)


 C Client might not cleanup correctly during close
 -

 Key: ZOOKEEPER-898
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Trivial
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch


 I was looking through the c-client code and noticed a situation where a 
 counter can be incorrectly incremented and a small memory leak can occur.
 In zookeeper.c : add_completion(), if close_requested is true, then the 
 completion will not be queued.  But at the end, outstanding_sync is still 
 incremented and free() never called on the newly allocated completion_list_t. 
  
 I will submit for review a diff that I believe corrects this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-851) ZK lets any node to become an observer

2010-10-28 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926016#action_12926016
 ] 

Henry Robinson commented on ZOOKEEPER-851:
--

I think what happens is that the leader happily lets the new follower connect, 
but that it won't be part of any voting procedure. It shouldn't become leader 
because no other nodes know about it to  propose or support a vote for it. 

To add a new node, you'll need to incrementally restart every node in your 
cluster with the new config.

 ZK lets any node to become an observer
 --

 Key: ZOOKEEPER-851
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.3.1
Reporter: Vishal K
Priority: Critical
 Fix For: 3.4.0


 I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as 
 show below:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I 
 created another entry for that node and started zk server. The zoo.cfg on the 
 first 3 nodes was left unchanged. The fourth node was able to join the 
 cluster even though the 3 nodes had no idea about the fourth node.
 zoo.cfg on fourth node:
 tickTime=2000
 dataDir=/var/zookeeper
 clientPort=2181
 initLimit=5
 syncLimit=2
 server.0=10.150.27.61:2888:3888
 server.1=10.150.27.62:2888:3888
 server.2=10.150.27.63:2888:3888
 server.3=10.17.117.71:2888:3888
 It looks like 10.17.117.71 is becoming an observer in this case. I was 
 expecting that the leader will reject 10.17.117.71.
 # telnet 10.17.117.71 2181
 Trying 10.17.117.71...
 Connected to 10.17.117.71.
 Escape character is '^]'.
 stat
 Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT
 Clients:
  /10.17.117.71:37297[1](queued=0,recved=1,sent=0)
 Latency min/avg/max: 0/0/0
 Received: 3
 Sent: 2
 Outstanding: 0
 Zxid: 0x20065
 Mode: follower
 Node count: 288

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.