#25196: Cut off recent dates from several CSV files
Reporter: karsten | Owner: karsten
Type: defect | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: iwakeh | Sponsor:
Changes (by karsten):
* status: needs_revision => needs_review
I set up a local metrics-web instance and modified it to run once per hour
and not cut off any dates at all. I'm
recent-dates-2018-03-06.pdf attaching a PDF file] showing how statistics
for given dates (colors) change (y axis) over the UTC day of March 6 (x
axis). If a colored line changes much over the day, we cannot reasonable
include it yet and need to cut off that date. There's a trade-off of
holding back a statistic that is still changing too much vs. delaying a
statistic more than necessary and not being able to act on the data.
Here's what I think we should do for all current statistics files:
- `servers.csv`: We currently cut off 2 days (today = 2018-03-06 and the
day before = 2018-03-05), but it would be sufficient to cut off just 1 day
(today). The reason is that this file is based on consensuses and
referenced server descriptors, all of which are typically available at the
end of a day.
- `ipv6servers.csv`: Same as `servers.csv`, except that we don't cut off
anything yet, though I think we should, following the same rationale as
- `advbwdist.csv`: Same as `servers.csv`, except that we already cut off
just 1 day, so there's no need to change anything here.
- `bandwidth.csv`: This file is based on statistics reported in extra-
info descriptors, and those might take more time to come in. We're also
not doing any estimates on the numbers we go so far, but we're simply
adding up what we have. So, if 5% of statistics are still missing, those
missing statistics will still change the end result by 5%. I suggest to
wait 3 days. We currently cut off 4, but I think 3 should be sufficient.
The better (long-term) solution would be to compensate missing data by
extrapolating what we have, but we're not there yet.
- `connbidirect2.csv`: Same as for `bandwidth.csv`, except that we're
providing averages where missing descriptors don't affect the result as
much. Cutting of 2 days will be fine (today and yesterday).
- `clients.csv` and `userstats-combined.csv`: Same as for
`connbidirect2.csv`, except that we're being smarter about estimating
numbers from given reports. Cutting of 2 days will be enough (today and
- `hidserv.csv`: Same as `clients.csv` et al., except we're being quite
smart about extrapolating reported statistics, so that we might even cut
off just 1 day. But let's do 2 days as before to be on the safe side.
- `torperf-1.1.csv`: OnionPerf only provides completed days, so it
depends on when we get those files and whether we get all of them at once.
I'm less certain here, but I think we're doing okay by cutting off 2 days.
- `webstats.csv`: I don't have good data, because webstats.tp.o was down
for a couple days now. This might also change after switching to
CollecTor's webstats module. I'd say we don't touch this now and revisit
it after switching to CollecTor.
Please review [https://gitweb.torproject.org/karsten/metrics-
commit 450d9f1 in my updated task-25196 branch]. If possible, I'd like to
make changes tomorrow (Thursday).
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25196#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
tor-bugs mailing list