[jira] [Updated] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm
[ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] C. Scott Andreas updated CASSANDRA-3830: Component/s: Streaming and Messaging > gossip-to-seeds is not obviously independent of failure detection algorithm > > > Key: CASSANDRA-3830 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3830 > Project: Cassandra > Issue Type: Task > Components: Distributed Metadata >Reporter: Peter Schuller >Assignee: Peter Schuller >Priority: Minor > > The failure detector, ignoring all the theory, boils down to an > extremely simple algorithm. The FD keeps track of a sliding window (of > 1000 currently) intervals of heartbeat for a given host. Meaning, we > have a track record of the last 1000 times we saw an updated heartbeat > for a host. > At any given moment, a host has a score which is simply the time since > the last heartbeat, over the *mean* interval in the sliding > window. For historical reasons a simple scaling factor is applied to > this prior to checking the phi conviction threshold. > (CASSANDRA-2597 has details, but thanks to Paul's work there it's now > trivial to understand what it does based on gut feeling) > So in effect, a host is considered down if we haven't heard from it in > some time which is significantly longer than the "average" time we > expect to hear from it. > This seems reasonable, but it does assume that under normal conditions > the average time between heartbeats does not change for reasons other > than those that would be plausible reasons to think a node is > unhealthy. > This assumption *could* be violated by the gossip-to-seed > feature. There is an argument to avoid gossip-to-seed for other > reasons (see CASSANDRA-3829), but this is a concrete case in which the > gossip-to-seed could cause a negative side-effect of the general kind > mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds > not being continuously tested). Normally, due to gossip to seed, > everyone essentially sees latest information within very few hart > beats (assuming only 2-3 seeds). But should all seeds be down, > suddenly we flip a switch and start relying on generalized propagation > in the gossip system, rather than the seed special case. > The potential problem I forese here is that if the average propagation > time suddenly spikes when all seeds become available, it could cause > bogus flapping of nodes into down state. > In order to test this, I deployeda ~ 180 node cluster with a version > that logs heartbet information on each interpret(), similar to: > INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) > ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean > is 1557.27778 > It turns out that, at least at 180 nodes, with 4 seed nodes, whether > or not seeds are running *does not* seem to matter significantly. In > both cases, the mean interval is around 1500 milliseconds. > I don't feel I have a good grasp of whether this is incidental or > guaranteed, and it would be good to at least empirically test > propagation time w/o seeds at differnet cluster sizes; it's supposed > to be un-affected by cluster size ({{RING_DELAY}} is static for this > reason, is my understanding). Would be nice to see this be the case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm
[ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] C. Scott Andreas updated CASSANDRA-3830: Component/s: (was: Streaming and Messaging) Distributed Metadata > gossip-to-seeds is not obviously independent of failure detection algorithm > > > Key: CASSANDRA-3830 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3830 > Project: Cassandra > Issue Type: Task > Components: Distributed Metadata >Reporter: Peter Schuller >Assignee: Peter Schuller >Priority: Minor > > The failure detector, ignoring all the theory, boils down to an > extremely simple algorithm. The FD keeps track of a sliding window (of > 1000 currently) intervals of heartbeat for a given host. Meaning, we > have a track record of the last 1000 times we saw an updated heartbeat > for a host. > At any given moment, a host has a score which is simply the time since > the last heartbeat, over the *mean* interval in the sliding > window. For historical reasons a simple scaling factor is applied to > this prior to checking the phi conviction threshold. > (CASSANDRA-2597 has details, but thanks to Paul's work there it's now > trivial to understand what it does based on gut feeling) > So in effect, a host is considered down if we haven't heard from it in > some time which is significantly longer than the "average" time we > expect to hear from it. > This seems reasonable, but it does assume that under normal conditions > the average time between heartbeats does not change for reasons other > than those that would be plausible reasons to think a node is > unhealthy. > This assumption *could* be violated by the gossip-to-seed > feature. There is an argument to avoid gossip-to-seed for other > reasons (see CASSANDRA-3829), but this is a concrete case in which the > gossip-to-seed could cause a negative side-effect of the general kind > mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds > not being continuously tested). Normally, due to gossip to seed, > everyone essentially sees latest information within very few hart > beats (assuming only 2-3 seeds). But should all seeds be down, > suddenly we flip a switch and start relying on generalized propagation > in the gossip system, rather than the seed special case. > The potential problem I forese here is that if the average propagation > time suddenly spikes when all seeds become available, it could cause > bogus flapping of nodes into down state. > In order to test this, I deployeda ~ 180 node cluster with a version > that logs heartbet information on each interpret(), similar to: > INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) > ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean > is 1557.27778 > It turns out that, at least at 180 nodes, with 4 seed nodes, whether > or not seeds are running *does not* seem to matter significantly. In > both cases, the mean interval is around 1500 milliseconds. > I don't feel I have a good grasp of whether this is incidental or > guaranteed, and it would be good to at least empirically test > propagation time w/o seeds at differnet cluster sizes; it's supposed > to be un-affected by cluster size ({{RING_DELAY}} is static for this > reason, is my understanding). Would be nice to see this be the case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org