#25196: Cut off recent dates from several CSV files
 Reporter:  karsten             |          Owner:  karsten
     Type:  defect              |         Status:  needs_review
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:  iwakeh              |        Sponsor:
Changes (by karsten):

 * status:  needs_revision => needs_review


 I set up a local metrics-web instance and modified it to run once per hour
 and not cut off any dates at all. I'm
 recent-dates-2018-03-06.pdf attaching a PDF file] showing how statistics
 for given dates (colors) change (y axis) over the UTC day of March 6 (x
 axis). If a colored line changes much over the day, we cannot reasonable
 include it yet and need to cut off that date. There's a trade-off of
 holding back a statistic that is still changing too much vs. delaying a
 statistic more than necessary and not being able to act on the data.

 Here's what I think we should do for all current statistics files:
  - `servers.csv`: We currently cut off 2 days (today = 2018-03-06 and the
 day before = 2018-03-05), but it would be sufficient to cut off just 1 day
 (today). The reason is that this file is based on consensuses and
 referenced server descriptors, all of which are typically available at the
 end of a day.
  - `ipv6servers.csv`: Same as `servers.csv`, except that we don't cut off
 anything yet, though I think we should, following the same rationale as
  - `advbwdist.csv`: Same as `servers.csv`, except that we already cut off
 just 1 day, so there's no need to change anything here.
  - `bandwidth.csv`: This file is based on statistics reported in extra-
 info descriptors, and those might take more time to come in. We're also
 not doing any estimates on the numbers we go so far, but we're simply
 adding up what we have. So, if 5% of statistics are still missing, those
 missing statistics will still change the end result by 5%. I suggest to
 wait 3 days. We currently cut off 4, but I think 3 should be sufficient.
 The better (long-term) solution would be to compensate missing data by
 extrapolating what we have, but we're not there yet.
  - `connbidirect2.csv`: Same as for `bandwidth.csv`, except that we're
 providing averages where missing descriptors don't affect the result as
 much. Cutting of 2 days will be fine (today and yesterday).
  - `clients.csv` and `userstats-combined.csv`: Same as for
 `connbidirect2.csv`, except that we're being smarter about estimating
 numbers from given reports. Cutting of 2 days will be enough (today and
  - `hidserv.csv`: Same as `clients.csv` et al., except we're being quite
 smart about extrapolating reported statistics, so that we might even cut
 off just 1 day. But let's do 2 days as before to be on the safe side.
  - `torperf-1.1.csv`: OnionPerf only provides completed days, so it
 depends on when we get those files and whether we get all of them at once.
 I'm less certain here, but I think we're doing okay by cutting off 2 days.
  - `webstats.csv`: I don't have good data, because webstats.tp.o was down
 for a couple days now. This might also change after switching to
 CollecTor's webstats module. I'd say we don't touch this now and revisit
 it after switching to CollecTor.

 Please review [https://gitweb.torproject.org/karsten/metrics-
 commit 450d9f1 in my updated task-25196 branch]. If possible, I'd like to
 make changes tomorrow (Thursday).

Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25196#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
tor-bugs mailing list

Reply via email to