[jira] Updated: (ZOOKEEPER-877) zkpython does not work with python3.1

2010-09-28 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-877:
---

Status: Patch Available  (was: Open)

> zkpython does not work with python3.1
> -
>
> Key: ZOOKEEPER-877
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-877
> Project: Zookeeper
>  Issue Type: Bug
>  Components: contrib-bindings
>Affects Versions: 3.3.1
> Environment: linux+python3.1
>Reporter: TuxRacer
>Assignee: TuxRacer
> Fix For: 3.4.0
>
> Attachments: tests_py3k.tgz, zookeeper.c, zookeeper.c.patch.v1
>
>
> as written in the contrib/zkpython/README file:
> "Python >= 2.6 is required. We have tested against 2.6. We have not tested 
> against 3.x."
> this is probably more a 'new feature' request than a bug; anyway compiling 
> the pythn module and calling it returns an error at load time:
> python3.1
> Python 3.1.2 (r312:79147, May  8 2010, 16:36:46) 
> [GCC 4.4.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import zookeeper
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: /usr/local/lib/python3.1/dist-packages/zookeeper.so: undefined 
> symbol: PyString_AsString
> are there any plan to support Python3.X?
> I also tried to write a 3.1 ctypes wrapper but the C API seems in fact to be 
> written in C++, so python ctypes cannot be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-877) zkpython does not work with python3.1

2010-09-28 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-877:
---

Assignee: TuxRacer

> zkpython does not work with python3.1
> -
>
> Key: ZOOKEEPER-877
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-877
> Project: Zookeeper
>  Issue Type: Bug
>  Components: contrib-bindings
>Affects Versions: 3.3.1
> Environment: linux+python3.1
>Reporter: TuxRacer
>Assignee: TuxRacer
> Fix For: 3.4.0
>
> Attachments: tests_py3k.tgz, zookeeper.c, zookeeper.c.patch.v1
>
>
> as written in the contrib/zkpython/README file:
> "Python >= 2.6 is required. We have tested against 2.6. We have not tested 
> against 3.x."
> this is probably more a 'new feature' request than a bug; anyway compiling 
> the pythn module and calling it returns an error at load time:
> python3.1
> Python 3.1.2 (r312:79147, May  8 2010, 16:36:46) 
> [GCC 4.4.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import zookeeper
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: /usr/local/lib/python3.1/dist-packages/zookeeper.so: undefined 
> symbol: PyString_AsString
> are there any plan to support Python3.X?
> I also tried to write a 3.1 ctypes wrapper but the C API seems in fact to be 
> written in C++, so python ctypes cannot be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Snapshot on startup

2010-09-28 Thread Diogo Becker
On Tue, 2010-09-21 at 13:36 -0400, Jared Cantwell wrote:
> Hey all,
> 
> I was looking at the code that loads the snapshots/logged transactions
> into the database on startup, and noticed that the FileTxnSnapLog
> initializes the log iterator to the last transaction committed to the
> snapshot (restore()).  This causes the last transaction to be
> processed, even though its already in the snapshot.  I'm not sure if
> this is a big problem in reality, or if it was intentional.  Does
> anyone know anything about this?

Hi Jared,
there are other problems on startup (ZOOKEEPER-874 and 876), e.g., the
listener is never called inside restore().

-Diogo

> Also, it seems like loadDataBase is called twice in
> ZooKeeperServer.loadData(), is that intentional for some reason?
> 
> Thanks,
> Jared



[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-28 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915796#action_12915796
 ] 

Benjamin Reed commented on ZOOKEEPER-822:
-

looks good overall flavio. just a quick questions: i notice that operations on 
senderWorkerMap in initiateConnection are not synchronized. senderWorkerMap is 
concurrent, but there could be a race between the get, put, and vsw.finish if 
initiateConnection is called concurrently for the same sid. right?

also you need to add a blurb to the config doc for the timeout system variable, 
which should be "zookeeper.cnxtimeout" so that it can be set from the 
configuration file.

> Leader election taking a long time  to complete
> ---
>
> Key: ZOOKEEPER-822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.3.0
>Reporter: Vishal K
>Assignee: Vishal K
>Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
> test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch_v1
>
>
> Created a 3 node cluster.
> 1 Fail the ZK leader
> 2. Let leader election finish. Restart the leader and let it join the 
> 3. Repeat 
> After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
> Note- we didn't have any ZK clients and no new znodes were created.
> zoo.cfg is shown below:
> #Mon Jul 19 12:15:10 UTC 2010
> server.1=192.168.4.12\:2888\:3888
> server.0=192.168.4.11\:2888\:3888
> clientPort=2181
> dataDir=/var/zookeeper
> syncLimit=2
> server.2=192.168.4.13\:2888\:3888
> initLimit=5
> tickTime=2000
> I have attached logs from two nodes that took a long time to form the cluster 
> after failing the leader. The leader was down anyways so logs from that node 
> shouldn't matter.
> Look for "START HERE". Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-820) update c unit tests to ensure "zombie" java server processes don't cause failure

2010-09-28 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915799#action_12915799
 ] 

Benjamin Reed commented on ZOOKEEPER-820:
-

+1 this looks good to me. did you try it on cygwin?

> update c unit tests to ensure "zombie" java server processes don't cause 
> failure
> 
>
> Key: ZOOKEEPER-820
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Patrick Hunt
>Assignee: Michi Mutsuzaki
>Priority: Critical
> Fix For: 3.3.2, 3.4.0
>
> Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch
>
>
> When the c unit tests are run sometimes the server doesn't shutdown at the 
> end of the test, this causes subsequent tests (hudson esp) to fail.
> 1) we should try harder to make the server shut down at the end of the test, 
> I suspect this is related to test failing/cleanup
> 2) before the tests are run we should see if the old server is still running 
> and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-877) zkpython does not work with python3.1

2010-09-28 Thread TuxRacer (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TuxRacer updated ZOOKEEPER-877:
---

Attachment: zookeeper.c.v2
zookeeper.c.patch.v2

a new version of the C file to also support file streams logs in py3k

> zkpython does not work with python3.1
> -
>
> Key: ZOOKEEPER-877
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-877
> Project: Zookeeper
>  Issue Type: Bug
>  Components: contrib-bindings
>Affects Versions: 3.3.1
> Environment: linux+python3.1
>Reporter: TuxRacer
>Assignee: TuxRacer
> Fix For: 3.4.0
>
> Attachments: tests_py3k.tgz, zookeeper.c, zookeeper.c.patch.v1, 
> zookeeper.c.patch.v2, zookeeper.c.v2
>
>
> as written in the contrib/zkpython/README file:
> "Python >= 2.6 is required. We have tested against 2.6. We have not tested 
> against 3.x."
> this is probably more a 'new feature' request than a bug; anyway compiling 
> the pythn module and calling it returns an error at load time:
> python3.1
> Python 3.1.2 (r312:79147, May  8 2010, 16:36:46) 
> [GCC 4.4.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import zookeeper
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: /usr/local/lib/python3.1/dist-packages/zookeeper.so: undefined 
> symbol: PyString_AsString
> are there any plan to support Python3.X?
> I also tried to write a 3.1 ctypes wrapper but the C API seems in fact to be 
> written in C++, so python ctypes cannot be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-877) zkpython does not work with python3.1

2010-09-28 Thread TuxRacer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915817#action_12915817
 ] 

TuxRacer commented on ZOOKEEPER-877:


the new file streams handling introduced with the above patch should work with 
both python2.6 and python3.1, so the code could probably be simplified removing 
the 2.6 conditional compilation preprocessor directive.

> zkpython does not work with python3.1
> -
>
> Key: ZOOKEEPER-877
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-877
> Project: Zookeeper
>  Issue Type: Bug
>  Components: contrib-bindings
>Affects Versions: 3.3.1
> Environment: linux+python3.1
>Reporter: TuxRacer
>Assignee: TuxRacer
> Fix For: 3.4.0
>
> Attachments: tests_py3k.tgz, zookeeper.c, zookeeper.c.patch.v1, 
> zookeeper.c.patch.v2, zookeeper.c.v2
>
>
> as written in the contrib/zkpython/README file:
> "Python >= 2.6 is required. We have tested against 2.6. We have not tested 
> against 3.x."
> this is probably more a 'new feature' request than a bug; anyway compiling 
> the pythn module and calling it returns an error at load time:
> python3.1
> Python 3.1.2 (r312:79147, May  8 2010, 16:36:46) 
> [GCC 4.4.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import zookeeper
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: /usr/local/lib/python3.1/dist-packages/zookeeper.so: undefined 
> symbol: PyString_AsString
> are there any plan to support Python3.X?
> I also tried to write a 3.1 ctypes wrapper but the C API seems in fact to be 
> written in C++, so python ctypes cannot be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-09-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915825#action_12915825
 ] 

Flavio Junqueira commented on ZOOKEEPER-702:


+1, I'm pretty happy with the patch.

> GSoC 2010: Failure Detector Model
> -
>
> Key: ZOOKEEPER-702
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
> Project: Zookeeper
>  Issue Type: Wish
>Reporter: Henry Robinson
>Assignee: Abmar Barros
> Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
> chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
> ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch
>
>
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Requirements
> Java, some distributed systems knowledge, comfort implementing distributed 
> systems protocols
> Description
> ZooKeeper servers detects the failure of other servers and clients by 
> counting the number of 'ticks' for which it doesn't get a heartbeat from 
> other machines. This is the 'timeout' method of failure detection and works 
> very well; however it is possible that it is too aggressive and not easily 
> tuned for some more unusual ZooKeeper installations (such as in a wide-area 
> network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated 
> Java module, and implement several failure detectors to compare and contrast 
> their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
> phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
> is much more tunable and has some very interesting properties. This is a 
> great project if you are interested in distributed algorithms, or want to 
> help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915832#action_12915832
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

bq. to be overly clear - this is happening on just 1 server, the other servers 
on the cluster are not seeing this, is that right?

Yes, sv4borg9.

bq. any insight on GC and JVM activity. Are there significant pauses on the GC, 
or perhaps swapping of that jvm? How active is the JVM? How active (cpu) are 
the other processes on this host? You mentioned they are using 50% disk, what 
about cpu?

No swapping, GC activity is normal as far as I can tell by the GC log, 1 active 
CPU for that process according to top (the rest of the cpus are idle most of 
the time).  

bq. If I understood correctly the JVM hosting the ZK server is hosting other 
code as well, is that right? You mentioned something about hbase managing the 
ZK server, could you elaborate on that as well?

That machine is also the Namenode, JobTracker and HBase master (all in their 
own JVMs). The only thing special is that the quorum peers are started by HBase.

bq. Is there a way you could move the ZK datadir on that host to an unused 
spindle and see if that helps at all?

I'll look into that.

> QuorumCnxManager$SendWorker grows without bounds
> 
>
> Key: ZOOKEEPER-880
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Jean-Daniel Cryans
> Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915838#action_12915838
 ] 

Patrick Hunt commented on ZOOKEEPER-880:


JD tried moving the datadirectory to another disk (new datadir), that didn't 
help, same problem. Also note: the snapshot file is ~2mb in size.

> QuorumCnxManager$SendWorker grows without bounds
> 
>
> Key: ZOOKEEPER-880
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Jean-Daniel Cryans
> Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-09-28 Thread Abmar Barros (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abmar Barros updated ZOOKEEPER-702:
---

Status: Patch Available  (was: Open)

> GSoC 2010: Failure Detector Model
> -
>
> Key: ZOOKEEPER-702
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
> Project: Zookeeper
>  Issue Type: Wish
>Reporter: Henry Robinson
>Assignee: Abmar Barros
> Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
> chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
> ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch
>
>
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Requirements
> Java, some distributed systems knowledge, comfort implementing distributed 
> systems protocols
> Description
> ZooKeeper servers detects the failure of other servers and clients by 
> counting the number of 'ticks' for which it doesn't get a heartbeat from 
> other machines. This is the 'timeout' method of failure detection and works 
> very well; however it is possible that it is too aggressive and not easily 
> tuned for some more unusual ZooKeeper installations (such as in a wide-area 
> network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated 
> Java module, and implement several failure detectors to compare and contrast 
> their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
> phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
> is much more tunable and has some very interesting properties. This is a 
> great project if you are interested in distributed algorithms, or want to 
> help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915849#action_12915849
 ] 

Benjamin Reed commented on ZOOKEEPER-880:
-

is there an easy way to reproduce this?

> QuorumCnxManager$SendWorker grows without bounds
> 
>
> Key: ZOOKEEPER-880
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Jean-Daniel Cryans
> Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
> TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated ZOOKEEPER-880:
-

Attachment: TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz

Here's a new run, at TRACE-level, starting from a fresh log and a cleaned up 
dataDir on a different disk.

> QuorumCnxManager$SendWorker grows without bounds
> 
>
> Key: ZOOKEEPER-880
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Jean-Daniel Cryans
> Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
> TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915850#action_12915850
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

bq. is there an easy way to reproduce this?

Unfortunately none I can see... we have 5 clusters that use the same hardware 
and ZK configurations and we only find this issue on this cluster, on this 
specific node, although all the other nodes of that cluster have the same 
InterruptedExceptions (but aren't leaking SendWorkers).

> QuorumCnxManager$SendWorker grows without bounds
> 
>
> Key: ZOOKEEPER-880
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Jean-Daniel Cryans
> Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
> TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-822:
---

Status: Open  (was: Patch Available)

> Leader election taking a long time  to complete
> ---
>
> Key: ZOOKEEPER-822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.3.0
>Reporter: Vishal K
>Assignee: Vishal K
>Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
> test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch_v1
>
>
> Created a 3 node cluster.
> 1 Fail the ZK leader
> 2. Let leader election finish. Restart the leader and let it join the 
> 3. Repeat 
> After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
> Note- we didn't have any ZK clients and no new znodes were created.
> zoo.cfg is shown below:
> #Mon Jul 19 12:15:10 UTC 2010
> server.1=192.168.4.12\:2888\:3888
> server.0=192.168.4.11\:2888\:3888
> clientPort=2181
> dataDir=/var/zookeeper
> syncLimit=2
> server.2=192.168.4.13\:2888\:3888
> initLimit=5
> tickTime=2000
> I have attached logs from two nodes that took a long time to form the cluster 
> after failing the leader. The leader was down anyways so logs from that node 
> shouldn't matter.
> Look for "START HERE". Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-822:
---

Attachment: ZOOKEEPER-822-3.3.2.patch

> Leader election taking a long time  to complete
> ---
>
> Key: ZOOKEEPER-822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.3.0
>Reporter: Vishal K
>Assignee: Vishal K
>Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
> test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch_v1
>
>
> Created a 3 node cluster.
> 1 Fail the ZK leader
> 2. Let leader election finish. Restart the leader and let it join the 
> 3. Repeat 
> After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
> Note- we didn't have any ZK clients and no new znodes were created.
> zoo.cfg is shown below:
> #Mon Jul 19 12:15:10 UTC 2010
> server.1=192.168.4.12\:2888\:3888
> server.0=192.168.4.11\:2888\:3888
> clientPort=2181
> dataDir=/var/zookeeper
> syncLimit=2
> server.2=192.168.4.13\:2888\:3888
> initLimit=5
> tickTime=2000
> I have attached logs from two nodes that took a long time to form the cluster 
> after failing the leader. The leader was down anyways so logs from that node 
> shouldn't matter.
> Look for "START HERE". Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-822:
---

Attachment: ZOOKEEPER-822.patch

> Leader election taking a long time  to complete
> ---
>
> Key: ZOOKEEPER-822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.3.0
>Reporter: Vishal K
>Assignee: Vishal K
>Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
> test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch_v1
>
>
> Created a 3 node cluster.
> 1 Fail the ZK leader
> 2. Let leader election finish. Restart the leader and let it join the 
> 3. Repeat 
> After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
> Note- we didn't have any ZK clients and no new znodes were created.
> zoo.cfg is shown below:
> #Mon Jul 19 12:15:10 UTC 2010
> server.1=192.168.4.12\:2888\:3888
> server.0=192.168.4.11\:2888\:3888
> clientPort=2181
> dataDir=/var/zookeeper
> syncLimit=2
> server.2=192.168.4.13\:2888\:3888
> initLimit=5
> tickTime=2000
> I have attached logs from two nodes that took a long time to form the cluster 
> after failing the leader. The leader was down anyways so logs from that node 
> shouldn't matter.
> Look for "START HERE". Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete

2010-09-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-822:
---

Status: Patch Available  (was: Open)

Thanks for the comments, Ben. I have modified zookeeperAdmin and added the 
"zookeeper." prefix to the code.

Regarding your question, initiateConnection is called from two methods: 
testInitiateConnection (used only in tests) and connectOne. connectOne is 
synchronized. Do you still see an issue?

> Leader election taking a long time  to complete
> ---
>
> Key: ZOOKEEPER-822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
> Project: Zookeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.3.0
>Reporter: Vishal K
>Assignee: Vishal K
>Priority: Blocker
> Fix For: 3.3.2, 3.4.0
>
> Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
> test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, 
> ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822-3.3.2.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, 
> ZOOKEEPER-822.patch, ZOOKEEPER-822.patch, ZOOKEEPER-822.patch_v1
>
>
> Created a 3 node cluster.
> 1 Fail the ZK leader
> 2. Let leader election finish. Restart the leader and let it join the 
> 3. Repeat 
> After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
> Note- we didn't have any ZK clients and no new znodes were created.
> zoo.cfg is shown below:
> #Mon Jul 19 12:15:10 UTC 2010
> server.1=192.168.4.12\:2888\:3888
> server.0=192.168.4.11\:2888\:3888
> clientPort=2181
> dataDir=/var/zookeeper
> syncLimit=2
> server.2=192.168.4.13\:2888\:3888
> initLimit=5
> tickTime=2000
> I have attached logs from two nodes that took a long time to form the cluster 
> after failing the leader. The leader was down anyways so logs from that node 
> shouldn't matter.
> Look for "START HERE". Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-09-28 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-702:
---

Status: Open  (was: Patch Available)

I forgot to mention that the patch does not apply cleanly. I had to delete the 
first two lines (generated by eclipse), but once I did it applied cleanly. 
Abmar, could you upload a new patch? My +1 still holds...

> GSoC 2010: Failure Detector Model
> -
>
> Key: ZOOKEEPER-702
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
> Project: Zookeeper
>  Issue Type: Wish
>Reporter: Henry Robinson
>Assignee: Abmar Barros
> Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
> chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
> ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch
>
>
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Requirements
> Java, some distributed systems knowledge, comfort implementing distributed 
> systems protocols
> Description
> ZooKeeper servers detects the failure of other servers and clients by 
> counting the number of 'ticks' for which it doesn't get a heartbeat from 
> other machines. This is the 'timeout' method of failure detection and works 
> very well; however it is possible that it is too aggressive and not easily 
> tuned for some more unusual ZooKeeper installations (such as in a wide-area 
> network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated 
> Java module, and implement several failure detectors to compare and contrast 
> their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
> phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
> is much more tunable and has some very interesting properties. This is a 
> great project if you are interested in distributed algorithms, or want to 
> help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-820) update c unit tests to ensure "zombie" java server processes don't cause failure

2010-09-28 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915918#action_12915918
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-820:
---

Cygwin doesn't have lsof. 

Most of the time, 'zombie' process is caused by the invalid return code 
checking (it keeps starting the server process in a loop for 2 minutes). I 
think it's ok to use pid file to keep track of the server process. 

I'll submit a new patch.

--Michi

> update c unit tests to ensure "zombie" java server processes don't cause 
> failure
> 
>
> Key: ZOOKEEPER-820
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Patrick Hunt
>Assignee: Michi Mutsuzaki
>Priority: Critical
> Fix For: 3.3.2, 3.4.0
>
> Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch
>
>
> When the c unit tests are run sometimes the server doesn't shutdown at the 
> end of the test, this causes subsequent tests (hudson esp) to fail.
> 1) we should try harder to make the server shut down at the end of the test, 
> I suspect this is related to test failing/cleanup
> 2) before the tests are run we should see if the old server is still running 
> and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-820) update c unit tests to ensure "zombie" java server processes don't cause failure

2010-09-28 Thread Michi Mutsuzaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michi Mutsuzaki updated ZOOKEEPER-820:
--

Attachment: ZOOKEEPER-820.patch

Reverted back to use pid file instead of lsof. 

> update c unit tests to ensure "zombie" java server processes don't cause 
> failure
> 
>
> Key: ZOOKEEPER-820
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
> Project: Zookeeper
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Patrick Hunt
>Assignee: Michi Mutsuzaki
>Priority: Critical
> Fix For: 3.3.2, 3.4.0
>
> Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, 
> ZOOKEEPER-820.patch
>
>
> When the c unit tests are run sometimes the server doesn't shutdown at the 
> end of the test, this causes subsequent tests (hudson esp) to fail.
> 1) we should try harder to make the server shut down at the end of the test, 
> I suspect this is related to test failing/cleanup
> 2) before the tests are run we should see if the old server is still running 
> and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-881) ZooKeeperServer.loadData loads database twise

2010-09-28 Thread Jared Cantwell (JIRA)
ZooKeeperServer.loadData loads database twise
-

 Key: ZOOKEEPER-881
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-881
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Reporter: Jared Cantwell
Priority: Trivial


zkDb.loadDataBase() is called twice at the beginning of loadData().  It 
shouldn't have any negative affects, but is unnecessary.   A patch should be 
trivial.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-881) ZooKeeperServer.loadData loads database twice

2010-09-28 Thread Jared Cantwell (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jared Cantwell updated ZOOKEEPER-881:
-

Summary: ZooKeeperServer.loadData loads database twice  (was: 
ZooKeeperServer.loadData loads database twise)

> ZooKeeperServer.loadData loads database twice
> -
>
> Key: ZOOKEEPER-881
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-881
> Project: Zookeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Jared Cantwell
>Priority: Trivial
>
> zkDb.loadDataBase() is called twice at the beginning of loadData().  It 
> shouldn't have any negative affects, but is unnecessary.   A patch should be 
> trivial.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-882) Startup loads last transaction from snapshot

2010-09-28 Thread Jared Cantwell (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jared Cantwell updated ZOOKEEPER-882:
-

Attachment: 882.diff

A simple patch for consideration.

> Startup loads last transaction from snapshot
> 
>
> Key: ZOOKEEPER-882
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-882
> Project: Zookeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Jared Cantwell
>Priority: Minor
> Attachments: 882.diff
>
>
> On startup, the server first loads the latest snapshot, and then loads from 
> the log starting at the last transaction in the snapshot.  It should begin 
> from one past that last transaction in the log.  I will attach a possible 
> patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-882) Startup loads last transaction from snapshot

2010-09-28 Thread Jared Cantwell (JIRA)
Startup loads last transaction from snapshot


 Key: ZOOKEEPER-882
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-882
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Reporter: Jared Cantwell
Priority: Minor
 Attachments: 882.diff

On startup, the server first loads the latest snapshot, and then loads from the 
log starting at the last transaction in the snapshot.  It should begin from one 
past that last transaction in the log.  I will attach a possible patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-09-28 Thread Abmar Barros (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abmar Barros updated ZOOKEEPER-702:
---

Attachment: ZOOKEEPER-702.patch

Fixed the issue on the patch generation that Flavio mentioned.

> GSoC 2010: Failure Detector Model
> -
>
> Key: ZOOKEEPER-702
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
> Project: Zookeeper
>  Issue Type: Wish
>Reporter: Henry Robinson
>Assignee: Abmar Barros
> Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
> chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
> ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch
>
>
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Requirements
> Java, some distributed systems knowledge, comfort implementing distributed 
> systems protocols
> Description
> ZooKeeper servers detects the failure of other servers and clients by 
> counting the number of 'ticks' for which it doesn't get a heartbeat from 
> other machines. This is the 'timeout' method of failure detection and works 
> very well; however it is possible that it is too aggressive and not easily 
> tuned for some more unusual ZooKeeper installations (such as in a wide-area 
> network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated 
> Java module, and implement several failure detectors to compare and contrast 
> their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
> phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
> is much more tunable and has some very interesting properties. This is a 
> great project if you are interested in distributed algorithms, or want to 
> help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model

2010-09-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916021#action_12916021
 ] 

Patrick Hunt commented on ZOOKEEPER-702:


Patch applied cleanly for me. I ran the tests and they all passed.

I did notice that java test time for me moved from ~18 minutes to ~28 minutes. 
I don't think we should hold up this patch for this, however going fwd we 
should think about how to enable a quicker "commit test" (run by devs before 
submitting/committing) as well as longer running hudson & integration tests. 
Where the commit test runs more quickly (hopefully < 10minutes). We will 
probably need to use something like junit test category support.

Nice job, thanks.


> GSoC 2010: Failure Detector Model
> -
>
> Key: ZOOKEEPER-702
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
> Project: Zookeeper
>  Issue Type: Wish
>Reporter: Henry Robinson
>Assignee: Abmar Barros
> Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
> chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
> ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch
>
>
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Requirements
> Java, some distributed systems knowledge, comfort implementing distributed 
> systems protocols
> Description
> ZooKeeper servers detects the failure of other servers and clients by 
> counting the number of 'ticks' for which it doesn't get a heartbeat from 
> other machines. This is the 'timeout' method of failure detection and works 
> very well; however it is possible that it is too aggressive and not easily 
> tuned for some more unusual ZooKeeper installations (such as in a wide-area 
> network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated 
> Java module, and implement several failure detectors to compare and contrast 
> their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
> phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
> is much more tunable and has some very interesting properties. This is a 
> great project if you are interested in distributed algorithms, or want to 
> help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.