Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics

2022-06-26 Thread David Fifield
On Thu, Apr 21, 2022 at 05:47:12PM +0200, Silvia/Hiro wrote:
> On 17/4/22 2:16, David Fifield wrote:
> > I am trying to reproduce the "frac" computation from the Reproducible
> > Metrics instructions:
> > https://metrics.torproject.org/reproducible-metrics.html#relay-users
> > Which is also Section 3 in the tech report on counting bridge users:
> > https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4
> > 
> > h(R^H) * n(H) + h(H) * n(R\H)
> > frac = -
> >  h(H) * n(N)
> > 
> > My minor goal is to reproduce the "frac" column from the Metrics web
> > site (which I assume is the same as the frac above, expressed as a
> > percentage):
> > 
> > https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01=2022-04-08=all=off
> > date,country,users,lower,upper,frac
> > 2022-04-01,,2262557,,,92
> > 2022-04-02,,2181639,,,92
> > 2022-04-03,,2179544,,,93
> > 2022-04-04,,2350360,,,93
> > 2022-04-05,,2388772,,,93
> > 2022-04-06,,2356170,,,93
> > 2022-04-07,,2323184,,,93
> > 2022-04-08,,2310170,,,91
> > 
> > I'm having trouble with the computation of n(R\H) and h(R∧H). I
> > understand that R is the subset of relays that report directory request
> > counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
> > and H is the subset of relays that report directory request byte counts
> > (i.e. that have dirreq-write-history in their extra-info descriptors).
> > R and H partially overlap: there are relays that are in R but not H,
> > others that are in H but not R, and others that are in both.
> > 
> > The computations depend on some values that are directly from
> > descriptors:
> > n(R) = sum of hours, for relays with directory request counts
> > n(H) = sum of hours, for relays with directory write histories
> > h(H) = sum of written bytes, for relays with directory write histories
> > 
> > > Compute n(R\H) as the number of hours for which responses have been
> > > reported but no written directory bytes. This fraction is determined
> > > by summing up all interval lengths and then subtracting the written
> > > directory bytes interval length from the directory response interval
> > > length. Negative results are discarded.
> > I interpret this to mean: add up all the dirrect-stats-end intervals
> > (this is n(R)), add up all the dirreq-write-history intervals
> > (this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
> > would only be true when H is a subset of R.
> > 
> > > Compute h(R∧H) as the number of written directory bytes for the
> > > fraction of time when a server was reporting both written directory
> > > bytes and directory responses. As above, this fraction is determined
> > > by first summing up all interval lengths and then computing the
> > > minimum of both sums divided by the sum of reported written directory
> > > bytes.
> > This seems to be saying to compute h(R∧H) (a count of bytes) as
> > min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
> > hours / bytes. What would be more natural to me is
> > min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
> > n(R) and n(R) by the larger, then multiply this ratio by the observable
> > byte count. But this, too, only works when H is a subset of R.
> > 
> > Where is this computation done in the metrics code? I would like to
> > refer to it, but I could not find it.
> > 
> > Using the formulas and assumptions above, here's my attempt at computing
> > recent "frac" values:
> > 
> > date   `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
> > 2022-04-01 166584 177638.  2.24e13 125491.   0   1.59e13 0.753
> > 2022-04-02 166951 177466.  2.18e13 125686.   0   1.54e13 0.753
> > 2022-04-03 167100 177718.  2.27e13 127008.   0   1.62e13 0.760
> > 2022-04-04 166970 177559.  2.43e13 126412.   0   1.73e13 0.757
> > 2022-04-05 166729 177585.  2.44e13 125389.   0   1.72e13 0.752
> > 2022-04-06 166832 177470.  2.39e13 127077.   0   1.71e13 0.762
> > 2022-04-07 166532 177210.  2.48e13 127815.   0   1.79e13 0.768
> > 2022-04-08 167695 176879.  2.52e13 127697.   0   1.82e13 0.761
> > 
> > The "frac" column does not match the CSV. Also notice that n(N) < n(H),
> > which should be impossible because H is supposed to be a subset of N
> > (N is the set of all relays). But this is what I get when I estimate
> > n(N) from a network-status-consensus-3 and n(H) from extra-info
> > documents. Also notice that n(R) < n(H), which means that H cannot be a
> > subset of R, contrary to the observations above.
> 
> These computations are a bit hidden in metrics code. Specifically these are
> in the website repository but in the sql init scripts.
> 
> This is the view that is responsible for computing the data that are then
> published in the csv:
> 
> https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/master/src/main/sql/clients/init-userstats.sql#L695

Thank 

Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics

2022-04-21 Thread Silvia/Hiro


On 17/4/22 2:16, David Fifield wrote:

I am trying to reproduce the "frac" computation from the Reproducible
Metrics instructions:
https://metrics.torproject.org/reproducible-metrics.html#relay-users
Which is also Section 3 in the tech report on counting bridge users:
https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4

h(R^H) * n(H) + h(H) * n(R\H)
frac = -
 h(H) * n(N)

My minor goal is to reproduce the "frac" column from the Metrics web
site (which I assume is the same as the frac above, expressed as a
percentage):

https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01=2022-04-08=all=off
date,country,users,lower,upper,frac
2022-04-01,,2262557,,,92
2022-04-02,,2181639,,,92
2022-04-03,,2179544,,,93
2022-04-04,,2350360,,,93
2022-04-05,,2388772,,,93
2022-04-06,,2356170,,,93
2022-04-07,,2323184,,,93
2022-04-08,,2310170,,,91

I'm having trouble with the computation of n(R\H) and h(R∧H). I
understand that R is the subset of relays that report directory request
counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
and H is the subset of relays that report directory request byte counts
(i.e. that have dirreq-write-history in their extra-info descriptors).
R and H partially overlap: there are relays that are in R but not H,
others that are in H but not R, and others that are in both.

The computations depend on some values that are directly from
descriptors:
n(R) = sum of hours, for relays with directory request counts
n(H) = sum of hours, for relays with directory write histories
h(H) = sum of written bytes, for relays with directory write histories


Compute n(R\H) as the number of hours for which responses have been
reported but no written directory bytes. This fraction is determined
by summing up all interval lengths and then subtracting the written
directory bytes interval length from the directory response interval
length. Negative results are discarded.

I interpret this to mean: add up all the dirrect-stats-end intervals
(this is n(R)), add up all the dirreq-write-history intervals
(this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
would only be true when H is a subset of R.


Compute h(R∧H) as the number of written directory bytes for the
fraction of time when a server was reporting both written directory
bytes and directory responses. As above, this fraction is determined
by first summing up all interval lengths and then computing the
minimum of both sums divided by the sum of reported written directory
bytes.

This seems to be saying to compute h(R∧H) (a count of bytes) as
min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
hours / bytes. What would be more natural to me is
min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
n(R) and n(R) by the larger, then multiply this ratio by the observable
byte count. But this, too, only works when H is a subset of R.

Where is this computation done in the metrics code? I would like to
refer to it, but I could not find it.

Using the formulas and assumptions above, here's my attempt at computing
recent "frac" values:

date   `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
2022-04-01 166584 177638.  2.24e13 125491.   0   1.59e13 0.753
2022-04-02 166951 177466.  2.18e13 125686.   0   1.54e13 0.753
2022-04-03 167100 177718.  2.27e13 127008.   0   1.62e13 0.760
2022-04-04 166970 177559.  2.43e13 126412.   0   1.73e13 0.757
2022-04-05 166729 177585.  2.44e13 125389.   0   1.72e13 0.752
2022-04-06 166832 177470.  2.39e13 127077.   0   1.71e13 0.762
2022-04-07 166532 177210.  2.48e13 127815.   0   1.79e13 0.768
2022-04-08 167695 176879.  2.52e13 127697.   0   1.82e13 0.761

The "frac" column does not match the CSV. Also notice that n(N) < n(H),
which should be impossible because H is supposed to be a subset of N
(N is the set of all relays). But this is what I get when I estimate
n(N) from a network-status-consensus-3 and n(H) from extra-info
documents. Also notice that n(R) < n(H), which means that H cannot be a
subset of R, contrary to the observations above.


Hi David,

These computations are a bit hidden in metrics code. Specifically these 
are in the website repository but in the sql init scripts.


This is the view that is responsible for computing the data that are 
then published in the csv:


https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/master/src/main/sql/clients/init-userstats.sql#L695


Personally I am not sure what was the rationale behind this. I will try 
to go through the SQL myself and the reproducible metrics page and give 
you an answer.



Meanwhile I have opened an issue to track this: 
https://gitlab.torproject.org/tpo/network-health/analysis/-/issues/35




___
tor-dev mailing list
tor-dev@lists.torproject.org

Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics

2022-04-18 Thread David Fifield
On Mon, Apr 18, 2022 at 03:45:29PM -0600, David Fifield wrote:
> I was initially interested in this for the purpose of better estimating
> the number of Snowflake users. But now I've decided "frac" is not useful
> for that purpose: since there is only one bridge we care about, it does
> not make sense to adjust the numbers to account for other bridges that
> may not report the same set of statistics. I don't plan to take this
> investigation any further for the time being, but here is source code to
> reproduce the above tables. You will need:
> https://collector.torproject.org/archive/relay-descriptors/consensuses/consensuses-2022-04.tar.xz
> https://collector.torproject.org/archive/relay-descriptors/extra-infos/extra-infos-2022-04.tar.xz
> 
> ./relay_uptime.py consensuses-2022-04.tar.xz > relay_uptime.csv
> ./relay_dir.py extra-infos-2022-04.tar.xz > relay_dir.csv
> ./frac.py relay_uptime.csv relay_dir.csv

Missed one of the source files.
import datetime

NUM_PROCESSES = 4

# "If the contained statistics end time is more than 1 week older than the
# descriptor publication time in the "published" line, skip this line..."
END_THRESHOLD = datetime.timedelta(days = 7)

# "Also skip statistics with an interval length other than 1 day."
# We set the threshold higher, because some descriptors have an interval a few
# seconds larger than 86400.
INTERVAL_THRESHOLD = datetime.timedelta(seconds = 9)

def datetime_floor(d):
return d.replace(hour = 0, minute = 0, second = 0, microsecond = 0)

TIMEDELTA_1DAY = datetime.timedelta(seconds = 86400)
def segment_datetime_interval(begin, end):
cur = begin
while cur < end:
next = min(datetime_floor(cur + TIMEDELTA_1DAY), end)
delta = next - cur
yield (cur.date(), delta / (end - begin), delta / TIMEDELTA_1DAY)
cur = next
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] Metrics: Estimating fraction of reported directory-request statistics

2022-04-18 Thread David Fifield
On Sat, Apr 16, 2022 at 06:16:23PM -0600, David Fifield wrote:
> I am trying to reproduce the "frac" computation from the Reproducible
> Metrics instructions:
> https://metrics.torproject.org/reproducible-metrics.html#relay-users
> Which is also Section 3 in the tech report on counting bridge users:
> https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4
> 
>h(R^H) * n(H) + h(H) * n(R\H)
> frac = -
> h(H) * n(N)
> 
> My minor goal is to reproduce the "frac" column from the Metrics web
> site (which I assume is the same as the frac above, expressed as a
> percentage):
> 
> https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01=2022-04-08=all=off
> date,country,users,lower,upper,frac
> 2022-04-01,,2262557,,,92
> 2022-04-02,,2181639,,,92
> 2022-04-03,,2179544,,,93
> 2022-04-04,,2350360,,,93
> 2022-04-05,,2388772,,,93
> 2022-04-06,,2356170,,,93
> 2022-04-07,,2323184,,,93
> 2022-04-08,,2310170,,,91
> 
> I'm having trouble with the computation of n(R\H) and h(R∧H). I
> understand that R is the subset of relays that report directory request
> counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
> and H is the subset of relays that report directory request byte counts
> (i.e. that have dirreq-write-history in their extra-info descriptors).
> R and H partially overlap: there are relays that are in R but not H,
> others that are in H but not R, and others that are in both.
>
> The computations depend on some values that are directly from
> descriptors:
> n(R) = sum of hours, for relays with directory request counts
> n(H) = sum of hours, for relays with directory write histories
> h(H) = sum of written bytes, for relays with directory write histories
>
> ...
> 
> Using the formulas and assumptions above, here's my attempt at computing
> recent "frac" values:
> 
> date   `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
> 2022-04-01 166584 177638.  2.24e13 125491.   0   1.59e13 0.753
> 2022-04-02 166951 177466.  2.18e13 125686.   0   1.54e13 0.753
> 2022-04-03 167100 177718.  2.27e13 127008.   0   1.62e13 0.760
> 2022-04-04 166970 177559.  2.43e13 126412.   0   1.73e13 0.757
> 2022-04-05 166729 177585.  2.44e13 125389.   0   1.72e13 0.752
> 2022-04-06 166832 177470.  2.39e13 127077.   0   1.71e13 0.762
> 2022-04-07 166532 177210.  2.48e13 127815.   0   1.79e13 0.768
> 2022-04-08 167695 176879.  2.52e13 127697.   0   1.82e13 0.761

I tried computing n(R\H) and h(R∧H) from the definitions, rather than by
using the formulas in the Reproducible Metrics guide. This achieves an
almost matching "frac" column, though it is still about 1% too high.

date   `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
2022-04-01 166584 177638.  2.24e13 125491. 90.9  1.96e13 0.930
2022-04-02 166951 177466.  2.18e13 125686.181.   1.92e13 0.937
2022-04-03 167100 177718.  2.27e13 127008.154.   2.00e13 0.942
2022-04-04 166970 177559.  2.43e13 126412.134.   2.14e13 0.936
2022-04-05 166729 177585.  2.44e13 125389. 94.6  2.15e13 0.938
2022-04-06 166832 177470.  2.39e13 127077.162.   2.11e13 0.940
2022-04-07 166532 177210.  2.48e13 127815.102.   2.18e13 0.938
2022-04-08 167695 176879.  2.52e13 127697.158.   2.21e13 0.926

I got this by taking an explicit set intersection between the R and H
time intervals. So, for example, if the intervals making up n(R) and
n(H) are (with their lengths):

n(R)[---10---]  [12]  [---9---]
n(H) [12][--16--]  [--7--]

Then the intersection n(R∧H) is:

n(R∧H)   [-5-]  [-5-][3]  [3]

h(R∧H) comes pro-rating the n(H) intervals, each of which is associated
with an h(H) byte count). Suppose the [12] interval represents
1000 bytes. Then each of the [-5-] intervals that result from it in the
intersection are worth 5/12 × 1000 = 417 bytes.

We get n(R\H) from n(R) − n(R∧H):

n(R\H)  [-5-][4-][-6--]

This seems overall more correct, though it required a more elaborate
computation than the Reproducible Metrics guide prescribes. I'm still
not sure why it does not match exactly, and I would still appreciate a
pointer to where Tor Metrics does the "frac" computation.

I was initially interested in this for the purpose of better estimating
the number of Snowflake users. But now I've decided "frac" is not useful
for that purpose: since there is only one bridge we care about, it does
not make sense to adjust the numbers to account for other bridges that
may not report the same set of statistics. I don't plan to take this
investigation any further for the time being, but here is source code to
reproduce the above tables. You will need:
https://collector.torproject.org/archive/relay-descriptors/consensuses/consensuses-2022-04.tar.xz

[tor-dev] Metrics: Estimating fraction of reported directory-request statistics

2022-04-16 Thread David Fifield
I am trying to reproduce the "frac" computation from the Reproducible
Metrics instructions:
https://metrics.torproject.org/reproducible-metrics.html#relay-users
Which is also Section 3 in the tech report on counting bridge users:
https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4

   h(R^H) * n(H) + h(H) * n(R\H)
frac = -
h(H) * n(N)

My minor goal is to reproduce the "frac" column from the Metrics web
site (which I assume is the same as the frac above, expressed as a
percentage):

https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01=2022-04-08=all=off
date,country,users,lower,upper,frac
2022-04-01,,2262557,,,92
2022-04-02,,2181639,,,92
2022-04-03,,2179544,,,93
2022-04-04,,2350360,,,93
2022-04-05,,2388772,,,93
2022-04-06,,2356170,,,93
2022-04-07,,2323184,,,93
2022-04-08,,2310170,,,91

I'm having trouble with the computation of n(R\H) and h(R∧H). I
understand that R is the subset of relays that report directory request
counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
and H is the subset of relays that report directory request byte counts
(i.e. that have dirreq-write-history in their extra-info descriptors).
R and H partially overlap: there are relays that are in R but not H,
others that are in H but not R, and others that are in both.

The computations depend on some values that are directly from
descriptors:
n(R) = sum of hours, for relays with directory request counts
n(H) = sum of hours, for relays with directory write histories
h(H) = sum of written bytes, for relays with directory write histories

> Compute n(R\H) as the number of hours for which responses have been
> reported but no written directory bytes. This fraction is determined
> by summing up all interval lengths and then subtracting the written
> directory bytes interval length from the directory response interval
> length. Negative results are discarded.

I interpret this to mean: add up all the dirrect-stats-end intervals
(this is n(R)), add up all the dirreq-write-history intervals
(this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
would only be true when H is a subset of R.

> Compute h(R∧H) as the number of written directory bytes for the
> fraction of time when a server was reporting both written directory
> bytes and directory responses. As above, this fraction is determined
> by first summing up all interval lengths and then computing the
> minimum of both sums divided by the sum of reported written directory
> bytes.

This seems to be saying to compute h(R∧H) (a count of bytes) as
min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
hours / bytes. What would be more natural to me is
min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
n(R) and n(R) by the larger, then multiply this ratio by the observable
byte count. But this, too, only works when H is a subset of R.

Where is this computation done in the metrics code? I would like to
refer to it, but I could not find it.

Using the formulas and assumptions above, here's my attempt at computing
recent "frac" values:

date   `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
2022-04-01 166584 177638.  2.24e13 125491.   0   1.59e13 0.753
2022-04-02 166951 177466.  2.18e13 125686.   0   1.54e13 0.753
2022-04-03 167100 177718.  2.27e13 127008.   0   1.62e13 0.760
2022-04-04 166970 177559.  2.43e13 126412.   0   1.73e13 0.757
2022-04-05 166729 177585.  2.44e13 125389.   0   1.72e13 0.752
2022-04-06 166832 177470.  2.39e13 127077.   0   1.71e13 0.762
2022-04-07 166532 177210.  2.48e13 127815.   0   1.79e13 0.768
2022-04-08 167695 176879.  2.52e13 127697.   0   1.82e13 0.761

The "frac" column does not match the CSV. Also notice that n(N) < n(H),
which should be impossible because H is supposed to be a subset of N
(N is the set of all relays). But this is what I get when I estimate
n(N) from a network-status-consensus-3 and n(H) from extra-info
documents. Also notice that n(R) < n(H), which means that H cannot be a
subset of R, contrary to the observations above.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev