Statistics question on histogram vs. total_num_queries

John Todd via Unbound-users Thu, 19 Jan 2017 15:21:05 -0800

We import all (“stats_noreset”) our Unbound 1.5.10 data intoPrometheus (a TSDB) every 15 seconds. We then generate many differentreports out of the collected data. The current test system has very lowQPS - 100-300. I admit that the problem I describe below might be anissue with Prometheus or low QPS, but I’d like to ensure myassumptions about counters are correct before I start digging into asystem which seems to work quite well for all my other data thus far.

I noticed something a bit off with one of the graphs, and I’m tryingto understand what I’m doing wrong that is causing the problem. I haveadded up all of the incremental increasing results for the histogramsfor a particular interval (i.e.: subtracting last sample from currentsample) and then compared that against what I thought should be the same(or extremely close) number which would be contained in the“total.num.queries” counter using the same method for the same timeintervals. They are very, very different numbers - an unexpectedresult. (1)

Sample query below for histogram. Hopefully these should bemostly-evident even to a non-Prometheus user - “irate” means “givethe per-second rate over time” with “[1m]” giving a one minutemaximum time backwards from a sample to get the last value prior.Timespans are not shown here for brevity, but the queries are done withthe same time bounds (typically 1 minute buckets. I limit the queries to“prod”uction systems, and only for particular POPs as a templatedvariable.)


  sum(

irate(unbound_histogram_000000_000000_to_000000_000001{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000001_to_000000_000002{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000002_to_000000_000004{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000004_to_000000_000008{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000008_to_000000_000016{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000016_to_000000_000032{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000032_to_000000_000064{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000064_to_000000_000128{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000128_to_000000_000256{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000256_to_000000_000512{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_000512_to_000000_001024{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_001024_to_000000_002048{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_002048_to_000000_004096{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_004096_to_000000_008192{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_008192_to_000000_016384{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_016384_to_000000_032768{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_032768_to_000000_065536{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_065536_to_000000_131072{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_131072_to_000000_262144{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_262144_to_000000_524288{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000000_524288_to_000001_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000001_000000_to_000002_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000002_000000_to_000004_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000004_000000_to_000008_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000008_000000_to_000016_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000016_000000_to_000032_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000032_000000_to_000064_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000064_000000_to_000128_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000128_000000_to_000256_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000256_000000_to_000512_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_000512_000000_to_001024_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_001024_000000_to_002048_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_002048_000000_to_004096_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_004096_000000_to_008192_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_008192_000000_to_016384_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_016384_000000_to_032768_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_032768_000000_to_065536_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_065536_000000_to_131072_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_131072_000000_to_262144_000000{env="prod",loc=~"$POP"}[1m]) +ignoring(label)irate(unbound_histogram_262144_000000_to_524288_000000{env="prod",loc=~"$POP"}[1m])

This above query for an example 1 minute period provides a value of 159total queries per second. I then compare that result to the followingquery using the exact same time bounds:


 sum(irate(unbound_total_num_queries {env="prod",loc=~"$POP"}[1m]))

The above query results in 241 queries per second for that same 1 minuteinterval. I would expect the values to be within 5% of each other, butthey are quite dissimilar. The mismatched ratio I see stays fairly levelacross QPS variation - around .7:1 though it does spike and dip between.5:1 and .8:1 if I graph the ratio over time.

Am I misunderstanding how the histogram.* and total.num.queries arecounting queries and replies? My initial assumption was that they wouldbe almost identical for the same timeframe.

Do the histogram counters not include NXDOMAIN, nodata, SERVFAILresults? That seems inherently wrong to not measure the reply time forthose rcode results (those intervals are still visible to the end user!)though it was worth looking to see if that led to more sensiblematching. Even subtracting those rcodes out of the larger number(unbound_total_num_queries) the aggregated histogram values still are10-20% lower than that value though they are closer; I can’t tell ifthis is simply by chance, and is possibly the wrong path to a solution.

I could “look at the code” but that means farming it out to anactual developer, because I’m merely a pointy-haired boss/graphmongerwho wants to build KPI-type statistics and alerts. I’m hoping there isa document or quick answer someone can point me towards which will clearup the discrepancy I am finding.

JT

(1) Background story: I’m trying to build a graph that is“percentile bands” of response time, and I needed a denominator formy calculations. I could have added up all of the histogram results toget the denominator for my percentage graph, but that made my query toobig, so I just used “total.num.queries” but that led to a gap in thestacked graph, meaning down the rabbit hole I went and ended up here.

Statistics question on histogram vs. total_num_queries

Reply via email to