commit d4b17e906f55b553ac16bcf902967157e07a234d
Author: Karsten Loesing <[email protected]>
Date: Fri Nov 30 16:54:17 2018 +0100
Update Reproducible Metrics document.
Reflects code changes made in #28116 and #28305.
---
.../resources/web/jsps/reproducible-metrics.jsp | 51 +++++-----------------
1 file changed, 11 insertions(+), 40 deletions(-)
diff --git a/src/main/resources/web/jsps/reproducible-metrics.jsp
b/src/main/resources/web/jsps/reproducible-metrics.jsp
index 939b42e..b6df6c3 100644
--- a/src/main/resources/web/jsps/reproducible-metrics.jsp
+++ b/src/main/resources/web/jsps/reproducible-metrics.jsp
@@ -15,15 +15,6 @@
<div class="container">
-<div class="panel panel-danger">
-<div class="panel-heading">
-<h5 class="panel-title">Work in progress notice</h5>
-</div>
-<div class="panel-body">
-<p>As of July 2018, this page is still a work in progress. Handle with
care!</p>
-</div>
-</div>
-
<h1>Reproducible Metrics
<a href="#reproducible-metrics" name="reproducible-metrics"
class="anchor">#</a></h1>
@@ -103,7 +94,7 @@ Split observations to the covered UTC dates by assuming a
linear distribution of
<h4>Step 3: Estimate fraction of reported directory-request statistics</h4>
<p>The next step after parsing descriptors is to estimate the fraction of
reported directory-request statistics on a given day.
-This fraction, a value between <var>0%</var> and <var>100%</var>, will be used
in the next step to extrapolate observed request numbers to expected network
totals.
+This fraction will be used in the next step to extrapolate observed request
numbers to expected network totals.
For further background on the following calculation method, refer to the
technical report titled <a
href="https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf">"Counting
daily bridge users"</a> which also applies to relay users.
In the following, we're using the term server instead of relay or bridge,
because the estimation method is exactly the same for relays and bridges.</p>
@@ -139,7 +130,7 @@ This approach also works with <var>r(R)</var> being the sum
of requests from <em
<pre>r(N) = floor(r(R) / frac / 10)</pre>
<p>A client that is connected 24/7 makes about 15 requests per day, but not
all clients are connected 24/7, so we picked the number 10 for the average
client. We simply divide directory requests by 10 and consider the result as
the number of users. Another way of looking at it, is that we assume that each
request represents a client that stays online for one tenth of a day, so 2
hours and 24 minutes.</p>
-<p>Skip dates where <var>frac</var> is smaller than 10% and hence too low for
a robust estimate, or where <var>frac</var> is greater than 100%, which would
indicate an issue in the previous step.</p>
+<p>Skip dates where <var>frac</var> is smaller than 10% and hence too low for
a robust estimate. Also skip dates where <var>frac</var> is greater than 110%,
which would indicate an issue in the previous step. We picked 110% as upper
bound, not 100%, because there can be relays reporting statistics that
temporarily didn't make it into the consensus, and we accept up to 10% of those
additional statistics. However, there needs to be some upper bound to exclude
obvious outliers with fractions of 120%, 150%, or even 200%.</p>
<h4>Step 5: Compute ranges of expected clients per day to detect potential
censorship events</h4>
@@ -278,14 +269,12 @@ Refer to the <a
href="https://gitweb.torproject.org/torspec.git/tree/dir-spec.tx
<li>Relay flags: Parse relay flags from the <code>"s"</code> line. If there is
no <code>"Running"</code> flag, skip this consensus entry. This ensures that we
only consider running relays. Also parse any other relay flags from the
<code>"s"</code> line that the relay had assigned.</li>
</ul>
-<p>If a consensus contains zero running relays, we skip it in the <a
href="/relays-ipv6.html">Relays by IP version</a> graph, but not in the other
graphs (simply because we didn't get around to changing those graphs).
+<p>If a consensus contains zero running relays, we skip it.
This is mostly to rule out a rare edge case when only a minority of <a
href="/glossary.html#directory-authority">directory authorities</a> voted on
the <code>"Running"</code> flag.
In those cases, such a consensus would skew the average, even though relays
were likely running.</p>
<h4>Step 2: Parse relay server descriptors</h4>
-<p>Parsing relay server descriptors is an optional step. You only need to do
this if you want to break down the number of running relays by something that
relays report in their server descriptors. This includes, among other things,
the relay's platform string containing tor software version and operating
system and whether the relay announced an IPv6 OR address or permitted exiting
to IPv6 targets.</p>
-
<p>Obtain relay server descriptors from <a
href="/collector.html#type-server-descriptor">CollecTor</a>.
Again, refer to the <a
href="https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt">Tor
directory protocol, version 3</a> for details on the descriptor format.</p>
@@ -307,21 +296,16 @@ If the platform line is missing, we skip this descriptor,
which later leads to n
<h4>Step 3: Compute daily averages</h4>
-<p>Optionally, match consensus entries with server descriptors by SHA-1 digest.
+<p>Match consensus entries with server descriptors by SHA-1 digest.
Every consensus entry references exactly one server descriptor, and a server
descriptor may be referenced from an arbitrary number of consensus entries.
-We handle missing server descriptors differently in the graphs covered in this
section:</p>
-
-<ul>
-<li><a href="/versions.html">Relays by tor version</a> and <a
href="/platforms.html">Relays by platform</a>: If a referenced server
descriptor is missing, we also skip the consensus entry. We are aware that this
is slightly wrong, because we should either exclude a consensus with too few
matching server descriptors from the overall result, or at least count these
relays as unknown tor version or unknown platform.</li>
-<li><a href="/relays-ipv6.html">Relays by IP version</a>: If at least 0.1% of
referenced server descriptors are missing, we skip the consensus. We chose this
threshold as low, because missing server descriptors may easily skew the
results. However, a small number of missing server descriptors per consensus is
acceptable and also unavoidable.</li>
-</ul>
+If at least 0.1% of referenced server descriptors are missing, we skip the
consensus. We chose this threshold as low, because missing server descriptors
may easily skew the results. However, a small number of missing server
descriptors per consensus is acceptable and also unavoidable.</p>
<p>Go through all previously processed consensuses by valid-after UTC date.
Compute the arithmetic mean of running relays, possibly broken down by relay
flag, tor version, platform, or IPv6 capabilities, as the sum of all running
relays divided by the number of consensuses.
Round down to the next integer number.</p>
<p>Skip the last day of the results if it matches the current UTC date,
because those averages may still change throughout the day.
-For the <a href="/relays-ipv6.html">Relays by IP version</a> graph we further
skip days for which fewer than 12 consensuses are known. The goal is to avoid
over-representing a few consensuses during periods when the directory
authorities had trouble producing a consensus for at least half of the day.</p>
+Further skip days for which fewer than 12 consensuses are known. The goal is
to avoid over-representing a few consensuses during periods when the directory
authorities had trouble producing a consensus for at least half of the day.</p>
<h3 id="running-bridges" class="hover">Running bridges
<a href="#running-bridges" class="anchor">#</a>
@@ -360,9 +344,6 @@ This timestamp is used to uniquely identify the status
while processing, and the
<h4>Step 2: Parse bridge server descriptors.</h4>
-<p>Parsing bridge server descriptors is an optional step. You only need to do
this if you want to break down the number of running bridges by something that
bridges report in their server descriptors.
-This includes, among other things, whether the bridge announced an IPv6 OR
address.</p>
-
<p>Obtain bridge server descriptors from <a
href="/collector.html#type-bridge-server-descriptor">CollecTor</a>.
As above, refer to the <a href="/bridge-descriptors.html">Tor bridge
descriptors page</a> for details on the descriptor format.</p>
@@ -375,21 +356,16 @@ As above, refer to the <a
href="/bridge-descriptors.html">Tor bridge descriptors
<h4>Step 3: Compute daily averages</h4>
-<p>Optionally, match status entries with server descriptors by SHA-1 digest.
+<p>Match status entries with server descriptors by SHA-1 digest.
Every status entry references exactly one server descriptor, and a server
descriptor may be referenced from an arbitrary number of status entries.
If at least 0.1% of referenced server descriptors are missing, we skip the
status.
We chose this threshold as low, because missing server descriptors may easily
skew the results.
However, a small number of missing server descriptors per status is acceptable
and also unavoidable.</p>
-<p>We compute averages differently in the graphs covered in this section:</p>
-
-<ul>
-<li><a href="/networksize.html">Relays and bridges</a>: For each bridge
authority, compute the arithmetic mean of running bridges as the sum of all
running bridges divided by the number of statuses; sum up averages for all
bridge authorities per day and round down to the next integer number.</li>
-<li><a href="/bridges-ipv6.html">Bridges by IP version</a>: Compute the
arithmetic mean of running bridges as the sum of all running bridges divided by
the number of statuses and round down to the next integer number. We are aware
that this approach does not correctly reflect that bridges typically register
at a single bridge authority only.</li>
-</ul>
+<p>Compute the arithmetic mean of running bridges as the sum of all running
bridges divided by the number of statuses and round down to the next integer
number. We are aware that this approach does not correctly reflect that bridges
typically register at a single bridge authority only.</p>
<p>Skip the last day of the results if it matches the current UTC date,
because those averages may still change throughout the day.
-For the <a href="/bridges-ipv6.html">Bridges by IP version</a> graph we
further skip days for which fewer than 12 statuses are known.
+Further skip days for which fewer than 12 statuses are known.
The goal is to avoid over-representing a few statuses during periods when the
bridge directory authority had trouble producing a status for at least half of
the day.</p>
<h3 id="consensus-weight" class="hover">Consensus weight
@@ -483,12 +459,7 @@ We consider a relay with the <code>"Guard"</code> flag as
guard and a relay with
<p>In order to compute these averages, first match consensus entries with
server descriptors by SHA-1 digest.
Every consensus entry references exactly one server descriptor, and a server
descriptor may be referenced from an arbitrary number of consensus entries.
-We handle missing server descriptors differently in the graphs covered in this
section:</p>
-
-<ul>
-<li><a href="/bandwidth.html">Total relay bandwidth</a> and <a
href="/bandwidth-flags.html">Advertised and consumed bandwidth by relay
flag</a>: If a referenced server descriptor is missing, we also skip the
consensus entry. We are aware that this is slightly wrong, because we should
rather exclude a consensus with too few matching server descriptors from the
overall result than including it with an advertised bandwidth sum that is too
low.</li>
-<li><a href="/advbw-ipv6.html">Advertised bandwidth by IP version</a>: If at
least 0.1% of referenced server descriptors are missing, we skip the consensus.
We chose this threshold as low, because missing server descriptors may easily
skew the results. However, a small number of missing server descriptors per
consensus is acceptable and also unavoidable.</li>
-</ul>
+If at least 0.1% of referenced server descriptors are missing, we skip the
consensus. We chose this threshold as low, because missing server descriptors
may easily skew the results. However, a small number of missing server
descriptors per consensus is acceptable and also unavoidable.</p>
<p>Go through all previously processed consensuses by valid-after UTC date.
Compute the arithmetic mean of advertised bandwidth as the sum of all
advertised bandwidth values divided by the number of consensuses.
@@ -497,7 +468,7 @@ Round down to the next integer number.</p>
<p>Break down numbers by guards and/or exits by taking into account which <a
href="/glossary.html#relay-flag">relay flags</a> a consensus entry had that
referenced a server descriptor.</p>
<p>Skip the last day of the results if it matches the current UTC date,
because those averages may still change throughout the day.
-For the <a href="/advbw-ipv6.html">Advertised bandwidth by IP version</a>
graph we further skip days for which fewer than 12 consensuses are known.
+Further skip days for which fewer than 12 consensuses are known.
The goal is to avoid over-representing a few consensuses during periods when
the directory authorities had trouble producing a consensus for at least half
of the day.</p>
<h4>Step 4: Compute ranks and percentiles</h4>
_______________________________________________
tor-commits mailing list
[email protected]
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-commits