Flavio Paiva Junqueira commented on ZOOKEEPER-702:
Here are a couple of comments:
# It sounds like we will have to sample the heartbeats instead of considering
all of them. If we use all messages as heartbeats, then we may end up in a
situation in which we have bursts of traffic interleaved with periods of
silence. In such cases, the failure detector might get confused for some time
when it transitions from burst to silence.Using the terminology of Chen et al.,
I was thinking that we could take \eta and only consider one heartbeat for
every period determined by \eta. In ZooKeeper today, we have such \eta for both
client-side detection and server-side detection. Alternatively, we could simply
work with the phi accrual detector on the server side for now, make sure it
works, and then revisit the other two.
# The choice of an exponential distribution in the math commons implementation
is curious. The original paper does assume a normal distribution, so I wonder
why it uses exponential instead.
> GSoC 2010: Failure Detector Model
> Key: ZOOKEEPER-702
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
> Project: Zookeeper
> Issue Type: Wish
> Reporter: Henry Robinson
> Assignee: Abmar Barros
> Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt,
> chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt,
> ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch
> Failure Detector Module
> Possible Mentor
> Henry Robinson (henry at apache dot org)
> Java, some distributed systems knowledge, comfort implementing distributed
> systems protocols
> ZooKeeper servers detects the failure of other servers and clients by
> counting the number of 'ticks' for which it doesn't get a heartbeat from
> other machines. This is the 'timeout' method of failure detection and works
> very well; however it is possible that it is too aggressive and not easily
> tuned for some more unusual ZooKeeper installations (such as in a wide-area
> network, or even in a mobile ad-hoc network).
> This project would abstract the notion of failure detection to a dedicated
> Java module, and implement several failure detectors to compare and contrast
> their appropriateness for ZooKeeper. For example, Apache Cassandra uses a
> phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which
> is much more tunable and has some very interesting properties. This is a
> great project if you are interested in distributed algorithms, or want to
> help re-factor some of ZooKeeper's internal code.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.