My 2 cents: * We try to structure our stats in a Errors->Saturation->Utilization way, which is consistent with USE methodology[1] or Google's Four Golden Signals[2]. In case of unbound it is: * Servfails, availability measured by blackbox tests * Queue sizes, queue drops, blackbox tests latency * Number of Queries w/ breakdowns by querytype/answercode, cache hit rate * We also graphs some basic internal stats, like memory usage, cpu usage, restart rate, etc. * Breakdowns and drill downs are very useful to reduce MTTR.
[1] http://www.brendangregg.com/usemethod.html [2] https://landing.google.com/sre/book.html > On Nov 23, 2016, at 11:55 AM, John Todd via Unbound-users > <[email protected]> wrote: > > On 23 Nov 2016, at 0:49, Jaap Akkerhuis via Unbound-users wrote: > >> Alexander via Unbound-users writes: >> >>> Hi to every one, can you help to monitor unbound dns with cacti? >>> I'm tried to set up unbound and cacti, but the graphs are empty. I'm >>> installed Dmitriy Demidov package. >> >> Once I set-up cacti to do this, but I'm not really happy with that. >> >>> Can you tell me others tools for monitoring dns queues? Some tips >>> for monitoring DNS? >> >> I really prefer using munin. See the user contributed directory. > > [snip] > > I know it’s not a direct answer to the top part of the original question, but > perhaps it does answer the second part about monitoring queues. We’ve > recently created an exporter for Unbound resolver for importation into > Prometheus, which seems to work quite well. We then use Grafana to extract > and visualize information from Prometheus. Building charts once you get the > hang of the query language is quite easy, and allows on-the-fly regeneration > of data visualization and complex comparisons/aggregations if you have > multiple servers, locations, or services. Here is an example chart that took > about 30 seconds to build. There are also monitoring components for > Prometheus and/or Grafana which can generate alerts based on metrics in a > more complex way other than just visualization, but that perhaps is outside > the scope of this thread. There are a number of tools for importing other > system-level data into Prometheus, and it may be a good idea to investigate > those other components to compliment or replace your existing monitoring > systems if they do what you need. It is not trivial to learn - the query > language is mostly unlike SQL, and there are quite a few ways to fail > silently with what seem to be legitimate queries, but if you know the ground > truth of one system you can start iteratively trying to draw graphs until you > figure out the right way to do it. > > If there is interest, we can try to work on getting the exporter we wrote in > a condition where it could be provided in the contrib directory. It uses the > “push gateway” method, which is not ideal but does work well enough. (Note: > “Prometheus Unbound” is also a novel by Percy Bysshe Shelley, which makes > keyword searching for prior work on this a bit difficult, so apologies if > someone has already done this project. :-) > > > Prometheus overview: > > To give an example of how a graph is built, this is the simplest query that I > performed to get the component of the chart that generates the “A” QTYPE > component line. I just cut/pasted this into a number of other queries in the > same graph to create the other lines, replacing “A” with “AAA”, “MX”, etc. > This aggregates all of the Unbound servers I am running (I have many) with > the “sum” command, then uses the “irate” command which shows change over > time, with a time interval of 1 minute. > > sum(irate(unbound_num_query_type_A[1m])) > > I then specified that this is stacked chart, percentage-measured, with 60% as > the lower bound. I could command-click any of the labels shown and they > would disappear from the graph and it would be re-drawn without that > statistic instantly. Alternately, I could click on just one of the labels and > only that graph line would be shown, re-drawing instantly. > > A more complex query, limiting to systems that are tagged with “prod” (vs. > “dev”) and limiting to specific POPs is shown below. > The “env” and “loc” tags are made up by us, and the contents of those tags > are set on the remote server before the metrics are collected. This allows > arbitrary tagging of each metric so that it is possible to filter (think of > it as a modified “SELECT WHERE” statement.) The $POP string specification > (created by us, again another arbitrary tag name) is consumed by Grafana > using a concept called “templates”, which puts a pull-down list at the top of > the graph page with a list of all of the POPs we have. I can then select one > OR MORE POPs and the system will automatically aggregate all the data across > all those metrics and display it. I could put other filters in here that > would be parsed at the moment the graph is drawn. > > sum(irate(unbound_num_query_type_A{env="prod",loc=~"$POP"}[1m])) > > In summary: Once you start putting your monitoring data into a TSDB or > TSDB-ish system like Prometheus (or InfluxDB, or OpenTSDB) and creating > visualizations with Grafana, you will wonder how you possibly survived > without it. Even just using the most basic features is a huge win over older > systems, in my opinion, and moving up into the automation methods and > alerting methods as you get more experience is another win. If you’re looking > for a short intro to Prometheus, see the following presentation from > Monitorama 2015 by Jamie Wilkinson. > > Video: https://vimeo.com/131581353 > Slides: > https://docs.google.com/presentation/d/1X1rKozAUuF2MVc1YXElFWq9wkcWv3Axdldl8LOH9Vik/edit#slide=id.ga150a40c0_0_193 > > If you’re looking for an introduction to Grafana, there are many - Google > will be a better guide than I. > > JT > > <Screen Shot 2016-11-23 at 10.28.43 AM.png>
signature.asc
Description: Message signed with OpenPGP using GPGMail
