I am trying to reproduce the "frac" computation from the Reproducible Metrics instructions: https://metrics.torproject.org/reproducible-metrics.html#relay-users Which is also Section 3 in the tech report on counting bridge users: https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4
h(R^H) * n(H) + h(H) * n(R\H) frac = ----------------------------- h(H) * n(N) My minor goal is to reproduce the "frac" column from the Metrics web site (which I assume is the same as the frac above, expressed as a percentage): https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&end=2022-04-08&country=all&events=off date,country,users,lower,upper,frac 2022-04-01,,2262557,,,92 2022-04-02,,2181639,,,92 2022-04-03,,2179544,,,93 2022-04-04,,2350360,,,93 2022-04-05,,2388772,,,93 2022-04-06,,2356170,,,93 2022-04-07,,2323184,,,93 2022-04-08,,2310170,,,91 I'm having trouble with the computation of n(R\H) and h(R∧H). I understand that R is the subset of relays that report directory request counts (i.e. that have dirreq-stats-end in their extra-info descriptors) and H is the subset of relays that report directory request byte counts (i.e. that have dirreq-write-history in their extra-info descriptors). R and H partially overlap: there are relays that are in R but not H, others that are in H but not R, and others that are in both. The computations depend on some values that are directly from descriptors: n(R) = sum of hours, for relays with directory request counts n(H) = sum of hours, for relays with directory write histories h(H) = sum of written bytes, for relays with directory write histories > Compute n(R\H) as the number of hours for which responses have been > reported but no written directory bytes. This fraction is determined > by summing up all interval lengths and then subtracting the written > directory bytes interval length from the directory response interval > length. Negative results are discarded. I interpret this to mean: add up all the dirrect-stats-end intervals (this is n(R)), add up all the dirreq-write-history intervals (this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it would only be true when H is a subset of R. > Compute h(R∧H) as the number of written directory bytes for the > fraction of time when a server was reporting both written directory > bytes and directory responses. As above, this fraction is determined > by first summing up all interval lengths and then computing the > minimum of both sums divided by the sum of reported written directory > bytes. This seems to be saying to compute h(R∧H) (a count of bytes) as min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are hours / bytes. What would be more natural to me is min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of n(R) and n(R) by the larger, then multiply this ratio by the observable byte count. But this, too, only works when H is a subset of R. Where is this computation done in the metrics code? I would like to refer to it, but I could not find it. Using the formulas and assumptions above, here's my attempt at computing recent "frac" values: date `n(N)` `n(H)` `h(H)` `n(R)` `n(R\H)` `h(R∧H)` frac 2022-04-01 166584 177638. 2.24e13 125491. 0 1.59e13 0.753 2022-04-02 166951 177466. 2.18e13 125686. 0 1.54e13 0.753 2022-04-03 167100 177718. 2.27e13 127008. 0 1.62e13 0.760 2022-04-04 166970 177559. 2.43e13 126412. 0 1.73e13 0.757 2022-04-05 166729 177585. 2.44e13 125389. 0 1.72e13 0.752 2022-04-06 166832 177470. 2.39e13 127077. 0 1.71e13 0.762 2022-04-07 166532 177210. 2.48e13 127815. 0 1.79e13 0.768 2022-04-08 167695 176879. 2.52e13 127697. 0 1.82e13 0.761 The "frac" column does not match the CSV. Also notice that n(N) < n(H), which should be impossible because H is supposed to be a subset of N (N is the set of all relays). But this is what I get when I estimate n(N) from a network-status-consensus-3 and n(H) from extra-info documents. Also notice that n(R) < n(H), which means that H cannot be a subset of R, contrary to the observations above. _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev