[jira] [Updated] (CASSANDRA-6486) Latency Measurement
[ https://issues.apache.org/jira/browse/CASSANDRA-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] C. Scott Andreas updated CASSANDRA-6486: Component/s: Metrics > Latency Measurement > --- > > Key: CASSANDRA-6486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6486 > Project: Cassandra > Issue Type: Improvement > Components: Metrics >Reporter: Benedict >Priority: Minor > > Latency measurement in Cassandra is currently suboptimal. Exactly what the > latency measurements tell you isn't intuitively clear due to their > exponentially decaying, but amount to some view of the latency per > (unweighted) operation over the past, approximately, 10 minute period, with > greater weight given to more recent operations. This has some obvious flaws, > the most notable being that due to probabilistic sampling, large outlier > events (e.g. GC) can easily be lost over a multi-minute time horizon, and > even when caught are unlikely to appear even in the 99.9th percentile due to > accounting for a tiny fraction of events numerically. > I'm generally thinking about how we might improve on this, and want to dump > my ideas here for discussion. I think the following things should be targeted: > 1) Ability to see uniform latency measurements for different time horizons > stretching back from the present, e.g. last 1s, 1m, 1hr and 1day > 2) Ability to bound the error margin of statistics for all of these intervals > 3) Protect against losing outlier measurements > 4) Possibly offer the ability to weight statistics, so that longer latencies > are not underplayed even if they are counted > 5) Preferably non-blocking, memory efficient, and relatively garbage-free > (3) and (4) are the trickiest, as a theoretically sound and general approach > isn't immediately obvious. There are a number of possibilities that spring to > mind: > 1) ensure that we have enough sample points that we are probabilistically > guaranteed to not lose them, but over large time horizons this is problematic > due to memory constraints, and it doesn't address (4); > 2) count large events multiple times (or sub-slices of the events), based on > e.g. average op-rate. I am not a fan of this idea because it makes possibly > bad assumptions about behaviour and doesn't seem very theoretically sound; > 3) weight the probability of retaining an event by its length. the problem > with this approach is that it ties you into (4) without offering the current > view of statistics (i.e. unweighted operations), and it also doesn't lend > itself to efficient implementation > I'm currently leaning towards a fourth approach, which attempts to hybridise > uniform sampling and histogram behaviour, by separating the sample space into > ranges, each some multiple of the last (say 2x the size). Each range has a > uniform sample of events that occured in that range, plus a count of total > events. Ideally the size of the sample will be variable based on the number > of events occurring in any range, but that there will be a lower-bound > calculated to ensure we do not lose events. > This approach lends itself to all 5 goals above: > 1) by maintaining the same structure for each time horizon, and uniformly > sampling from all of the directly lower order time horizons to maintain it; > 2) by imposing minimum sample sizes for each range; > 3) ditto (2); > 4) by producing time/frequency-weighted statistics using the samples and > counts from each range; > 5) with thread-local array-based timers that are synchronised with the global > timer once every minimum period, by the owning thread > This also extends reasonably nicely the timers I have already written for > CASSANDRA-6199, so some of the work is already done. > Thoughts / discussion would be welcome, especially if you think I've missed > another obvious approach. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-6486) Latency Measurement
[ https://issues.apache.org/jira/browse/CASSANDRA-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-6486: -- Priority: Minor (was: Major) Assignee: (was: Benedict) Let's put this on the back burner. Coda metrics represents the industry standard and is used by hundreds of projects. It's not perfect, but it's Good Enough. > Latency Measurement > --- > > Key: CASSANDRA-6486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6486 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Priority: Minor > > Latency measurement in Cassandra is currently suboptimal. Exactly what the > latency measurements tell you isn't intuitively clear due to their > exponentially decaying, but amount to some view of the latency per > (unweighted) operation over the past, approximately, 10 minute period, with > greater weight given to more recent operations. This has some obvious flaws, > the most notable being that due to probabilistic sampling, large outlier > events (e.g. GC) can easily be lost over a multi-minute time horizon, and > even when caught are unlikely to appear even in the 99.9th percentile due to > accounting for a tiny fraction of events numerically. > I'm generally thinking about how we might improve on this, and want to dump > my ideas here for discussion. I think the following things should be targeted: > 1) Ability to see uniform latency measurements for different time horizons > stretching back from the present, e.g. last 1s, 1m, 1hr and 1day > 2) Ability to bound the error margin of statistics for all of these intervals > 3) Protect against losing outlier measurements > 4) Possibly offer the ability to weight statistics, so that longer latencies > are not underplayed even if they are counted > 5) Preferably non-blocking, memory efficient, and relatively garbage-free > (3) and (4) are the trickiest, as a theoretically sound and general approach > isn't immediately obvious. There are a number of possibilities that spring to > mind: > 1) ensure that we have enough sample points that we are probabilistically > guaranteed to not lose them, but over large time horizons this is problematic > due to memory constraints, and it doesn't address (4); > 2) count large events multiple times (or sub-slices of the events), based on > e.g. average op-rate. I am not a fan of this idea because it makes possibly > bad assumptions about behaviour and doesn't seem very theoretically sound; > 3) weight the probability of retaining an event by its length. the problem > with this approach is that it ties you into (4) without offering the current > view of statistics (i.e. unweighted operations), and it also doesn't lend > itself to efficient implementation > I'm currently leaning towards a fourth approach, which attempts to hybridise > uniform sampling and histogram behaviour, by separating the sample space into > ranges, each some multiple of the last (say 2x the size). Each range has a > uniform sample of events that occured in that range, plus a count of total > events. Ideally the size of the sample will be variable based on the number > of events occurring in any range, but that there will be a lower-bound > calculated to ensure we do not lose events. > This approach lends itself to all 5 goals above: > 1) by maintaining the same structure for each time horizon, and uniformly > sampling from all of the directly lower order time horizons to maintain it; > 2) by imposing minimum sample sizes for each range; > 3) ditto (2); > 4) by producing time/frequency-weighted statistics using the samples and > counts from each range; > 5) with thread-local array-based timers that are synchronised with the global > timer once every minimum period, by the owning thread > This also extends reasonably nicely the timers I have already written for > CASSANDRA-6199, so some of the work is already done. > Thoughts / discussion would be welcome, especially if you think I've missed > another obvious approach. -- This message was sent by Atlassian JIRA (v6.1.4#6159)