-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi Tom,
On 27/01/16 20:02, Tom Ritter wrote:
> [feel free to reply adding tor-project or whomever]
Sure, let me copy tor-dev@.
> Remember a while ago I lamented that I wished there was some
> monitoring service that could tell me when my metrics service;
> relay; or bwauth went down? I finally built one. I'm still kicking
> the tires on it, and I intend to improve it more over the next week
> or two - but I think it's here to stay.
>
> https://github.com/tomrittervg/checker
>
> Right now I have it running several monitoring jobs, with a second
> instance running with no jobs but serving as a peer. I have it
> checking a number of TCP ports (to see if my relays are still up),
> and I have custom jobs for metrics and the bwauth. They're in the
> samplejobs folder. They're very simplistic and bare-bones. My
> hope is that they can be fleshed out over time to account for more
> imaginative ways things could fail.
>
> I'm already discovering that my bwauth file sometimes gets more
> than two hours behind
>
> But I think the most useful thing here is that now I have a
> minimal framework for writing simplistic python jobs and having it
> monitor things for me. Maybe it would be useful for more people?
Yes! Well, I can't speak for other people, but having a monitoring
system for Metrics-related services would be very useful for me. In
fact, it's been on my list for a long time now. This seems like a
great opportunity to spend more thoughts on it.
I'm not sure if I mentioned this before, but we're using Nagios to
monitor Onionoo. The Nagios script we're using makes a tiny request
to Onionoo to see whether the contained timestamps are still recent.
That's an indirect way to notice problems with the data back-end, and
it has helped with detecting numerous problems in the past. More
details here:
https://gitweb.torproject.org/admin/tor-nagios.git/tree/tor-nagios-checks/checks/tor-check-onionoo
So, I don't know Nagios enough to say how it compares to your system.
But I could imagine that we write a similar check for CollecTor that
runs on your system and notifies us of problems. And maybe it's
possible to write that script in a way that it can also be deployed on
Nagios.
Here's what I could imagine that the script would do: every, say, 10
minutes it would fetch CollecTor's index.json which is specified here:
https://collector.torproject.org/#index-json
The script would then run a series of checks and report one of
statuses OK, WARNING, CRITICAL, or UNKNOWN:
- host is unreachable or index.json cannot be found for at least 30
minutes (CRITICAL)
- index.json contains invalid JSON for all checks in the last 30
minutes (CRITICAL)
- the contained "index_created" timestamp is older than 30 minutes
(WARNING) or older than 3 hours (CRITICAL)
- when concatenating "path" fields of nested objects, most recent
"last_modified" timestamp by path prefix is more than X behind
"index_created" (CRITICAL):
- /archive/bridge-descriptors/: 5 days
- /archive/exit-lists/: 5 days
- /archive/relay-descriptors/certs.tar.xz: 5 days
- /archive/relay-descriptors/consensuses/: 5 days
- /archive/relay-descriptors/extra-infos/: 5 days
- /archive/relay-descriptors/microdescs/: 5 days
- /archive/relay-descriptors/server-descriptors/: 5 days
- /archive/relay-descriptors/votes/: 5 days
- /archive/torperf/: 5 days
- /recent/torperf/: 12 hours
- /recent/bridge-descriptors/extra-infos/: 3 hours
- /recent/bridge-descriptors/server-descriptors/: 3 hours
- /recent/bridge-descriptors/statuses/: 3 hours
- /recent/exit-lists/: 3 hours
- /recent/relay-descriptors/consensuses/: 1.5 hours
- /recent/relay-descriptors/extra-infos/: 1.5 hours
- /recent/relay-descriptors/microdescs/consensus-microdesc/: 1.5 hours
- /recent/relay-descriptors/microdescs/micro/: 1.5 hours
- /recent/relay-descriptors/server-descriptors/: 1.5 hours
- /recent/relay-descriptors/votes/: 1.5 hours
In the detailed checks above, the script would not warn if index.json
does not contain any files with a given prefix (so that you can run
the script on your CollecTor instance that doesn't collect all the
things). And ideally, the script would include all warnings in its
output, not just the first.
That's one check, and it would probably catch most problems where
things get stale. I would like to add more checks, but those would
need more access to the CollecTor host than its index.json which is
publicly available. (That could mean that we export more status in a
debug.json or in another public place.) Some examples as follows, and
I'm only listing them for later:
- Is the host soon going to run out of disk space or inodes? (This
can easily be done with Nagios, I think.)
- Did the relay-descriptor part of CollecTor fail to parse a
descriptor using metrics-lib and hence did not store it to disk? (I'm
receiving hourly cron mails in this case, but I'd