Thank you for the details - this makes more sense. I did not understand the documentation linkages and missed that interpretation.


The cache miss rate is high because this example server is getting only cache misses from other recursive resolvers; it’s just a testbed, so the numbers and balances were not particularly important - we only needed some ‘real’ data even though it is skewed.

We are not looking for the median; we are attempting to summarize with better detail than a single statistic, yet not looking at 40 different buckets. Our intention is to create a summary band set that is less granular than what is provided by unbound natively. Having 40 lines in a graph is distracting and difficult to read, so we wanted to condense these ranges into a set of 5 that were arbitrarily chosen as “Fast”, “OK”, “Marginal”, “Slow”, and “Very slow”. How we label these and divide up the 40 existing buckets into those summary bands was then a matter of subjective iteration as we gained experience with the monitoring.

I’m pleased to see that the new denominator of num.cachemiss makes the graphs look as I would expect, and overlaying the resolution speed on the data we get from other recursive resolver software running in the same POPs now seems to match very closely. Now that we have good instrumentation that is consistent across resolvers, we can move forward with trying changes to increase the speed (reduce latency) of query responses.

JT



On 20 Jan 2017, at 1:01, W.C.A. Wijngaards via Unbound-users wrote:

Hi John,

The numbers add up to the num.cachemiss.  For your statistics you need
to add the num.cachehits queries to the 0 bucket, or create another
bucket with a zero response time for those cachehits queries.  Then
your percentile band can be calculated.

It is also documented, over here:
http://unbound.net/documentation/unbound-control.html
(it says that the cachemiss queries are the queries that use recursive
processing, and the histogram keeps track of queries that needed
recursive processing).

You have a low cache hit ratio, it looks like.  Perhaps increase cache
memory size in unbound.conf.  Or it could be an artifact of artificial
query inputs.

If the percentile band you are looking for is the median, this is
already calculated for you in recursion.time.median (that number is
only for the queries that needed recursion, it does not include the
cachehits that were answered immediately).

Best regards, Wouter

On 20/01/17 00:13, John Todd via Unbound-users wrote:

We import all (“stats_noreset”) our Unbound 1.5.10 data into
Prometheus (a TSDB) every 15 seconds.  We then generate many
different reports out of the collected data. The current test
system has very low QPS - 100-300.  I admit that the problem I
describe below might be an issue with Prometheus or low QPS, but
I’d like to ensure my assumptions about counters are correct before
I start digging into a system which seems to work quite well for
all my other data thus far.

I noticed something a bit off with one of the graphs, and I’m
trying to understand what I’m doing wrong that is causing the
problem. I have added up all of the incremental increasing results
for the histograms for a particular interval (i.e.: subtracting
last sample from current sample) and then compared that against
what I thought should be the same (or extremely close) number which
would be contained in the “total.num.queries” counter using the
same method for the same time intervals.  They are very, very
different numbers - an unexpected result.  (1)

Sample query below for histogram. Hopefully these should be
mostly-evident even to a non-Prometheus user - “irate” means “give
the per-second rate over time” with “[1m]” giving a one minute
maximum time backwards from a sample to get the last value prior.
Timespans are not shown here for brevity, but the queries are done
with the same time bounds (typically 1 minute buckets. I limit the
queries to “prod”uction systems, and only for particular POPs as a
templated variable.)

sum( irate(unbound_histogram_000000_000000_to_000000_000001
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000001_to_000000_000002
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000002_to_000000_000004
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000004_to_000000_000008
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000008_to_000000_000016
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000016_to_000000_000032
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000032_to_000000_000064
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000064_to_000000_000128
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000128_to_000000_000256
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000256_to_000000_000512
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_000512_to_000000_001024
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_001024_to_000000_002048
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_002048_to_000000_004096
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_004096_to_000000_008192
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_008192_to_000000_016384
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_016384_to_000000_032768
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_032768_to_000000_065536
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_065536_to_000000_131072
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_131072_to_000000_262144
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_262144_to_000000_524288
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000000_524288_to_000001_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000001_000000_to_000002_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000002_000000_to_000004_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000004_000000_to_000008_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000008_000000_to_000016_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000016_000000_to_000032_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000032_000000_to_000064_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000064_000000_to_000128_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000128_000000_to_000256_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000256_000000_to_000512_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_000512_000000_to_001024_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_001024_000000_to_002048_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_002048_000000_to_004096_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_004096_000000_to_008192_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_008192_000000_to_016384_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_016384_000000_to_032768_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_032768_000000_to_065536_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_065536_000000_to_131072_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_131072_000000_to_262144_000000
{env="prod",loc=~"$POP"}[1m]) + ignoring(label)
irate(unbound_histogram_262144_000000_to_524288_000000
{env="prod",loc=~"$POP"}[1m]) )

This above query for an example 1 minute period provides a value of
159 total queries per second. I then compare that result to the
following query using the exact same time bounds:

sum(irate(unbound_total_num_queries {env="prod",loc=~"$POP"}[1m]))

The above query results in 241 queries per second for that same 1
minute interval. I would expect the values to be within 5% of each
other, but they are quite dissimilar. The mismatched ratio I see
stays fairly level across QPS variation - around .7:1 though it
does spike and dip between .5:1 and .8:1 if I graph the ratio over
time.

Am I misunderstanding how the histogram.* and total.num.queries
are counting queries and replies?  My initial assumption was that
they would be almost identical for the same timeframe.

Do the histogram counters not include NXDOMAIN, nodata, SERVFAIL
results? That seems inherently wrong to not measure the reply time
for those rcode results (those intervals are still visible to the
end user!) though it was worth looking to see if that led to more
sensible matching. Even subtracting those rcodes out of the larger
number (unbound_total_num_queries) the aggregated histogram values
still are 10-20% lower than that value though they are closer; I
can’t tell if this is simply by chance, and is possibly the wrong
path to a solution.

I could “look at the code” but that means farming it out to an
actual developer, because I’m merely a pointy-haired
boss/graphmonger who wants to build KPI-type statistics and alerts.
I’m hoping there is a document or quick answer someone can point me
towards which will clear up the discrepancy I am finding.

JT



(1) Background story: I’m trying to build a graph that is
“percentile bands” of response time, and I needed a denominator for
my calculations. I could have added up all of the histogram results
to get the denominator for my percentage graph, but that made my
query too big, so I just used “total.num.queries” but that led to a
gap in the stacked graph, meaning down the rabbit hole I went and
ended up here.

Reply via email to