[tor-commits] [metrics-web/master] Update Reproducible Metrics document.

karsten Fri, 30 Nov 2018 08:04:15 -0800

commit d4b17e906f55b553ac16bcf902967157e07a234d
Author: Karsten Loesing <[email protected]>
Date:   Fri Nov 30 16:54:17 2018 +0100


    Update Reproducible Metrics document.
    
    Reflects code changes made in #28116 and #28305.
---
 .../resources/web/jsps/reproducible-metrics.jsp    | 51 +++++-----------------
 1 file changed, 11 insertions(+), 40 deletions(-)

diff --git a/src/main/resources/web/jsps/reproducible-metrics.jsp 
b/src/main/resources/web/jsps/reproducible-metrics.jsp
index 939b42e..b6df6c3 100644
--- a/src/main/resources/web/jsps/reproducible-metrics.jsp
+++ b/src/main/resources/web/jsps/reproducible-metrics.jsp
@@ -15,15 +15,6 @@
 
 <div class="container">
 
-<div class="panel panel-danger">
-<div class="panel-heading">
-<h5 class="panel-title">Work in progress notice</h5>
-</div>
-<div class="panel-body">
-<p>As of July 2018, this page is still a work in progress. Handle with 
care!</p>
-</div>
-</div>
-
 <h1>Reproducible Metrics
 <a href="#reproducible-metrics" name="reproducible-metrics" 
class="anchor">#</a></h1>
 
@@ -103,7 +94,7 @@ Split observations to the covered UTC dates by assuming a 
linear distribution of
 <h4>Step 3: Estimate fraction of reported directory-request statistics</h4>
 
 <p>The next step after parsing descriptors is to estimate the fraction of 
reported directory-request statistics on a given day.
-This fraction, a value between <var>0%</var> and <var>100%</var>, will be used 
in the next step to extrapolate observed request numbers to expected network 
totals.
+This fraction will be used in the next step to extrapolate observed request 
numbers to expected network totals.
 For further background on the following calculation method, refer to the 
technical report titled <a 
href="https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf";>"Counting
 daily bridge users"</a> which also applies to relay users.
 In the following, we're using the term server instead of relay or bridge, 
because the estimation method is exactly the same for relays and bridges.</p>
 
@@ -139,7 +130,7 @@ This approach also works with <var>r(R)</var> being the sum 
of requests from <em
 <pre>r(N) = floor(r(R) / frac / 10)</pre>
 
 <p>A client that is connected 24/7 makes about 15 requests per day, but not 
all clients are connected 24/7, so we picked the number 10 for the average 
client. We simply divide directory requests by 10 and consider the result as 
the number of users. Another way of looking at it, is that we assume that each 
request represents a client that stays online for one tenth of a day, so 2 
hours and 24 minutes.</p>
-<p>Skip dates where <var>frac</var> is smaller than 10% and hence too low for 
a robust estimate, or where <var>frac</var> is greater than 100%, which would 
indicate an issue in the previous step.</p>
+<p>Skip dates where <var>frac</var> is smaller than 10% and hence too low for 
a robust estimate. Also skip dates where <var>frac</var> is greater than 110%, 
which would indicate an issue in the previous step. We picked 110% as upper 
bound, not 100%, because there can be relays reporting statistics that 
temporarily didn't make it into the consensus, and we accept up to 10% of those 
additional statistics. However, there needs to be some upper bound to exclude 
obvious outliers with fractions of 120%, 150%, or even 200%.</p>
 
 <h4>Step 5: Compute ranges of expected clients per day to detect potential 
censorship events</h4>
 
@@ -278,14 +269,12 @@ Refer to the <a 
href="https://gitweb.torproject.org/torspec.git/tree/dir-spec.tx
 <li>Relay flags: Parse relay flags from the <code>"s"</code> line. If there is 
no <code>"Running"</code> flag, skip this consensus entry. This ensures that we 
only consider running relays. Also parse any other relay flags from the 
<code>"s"</code> line that the relay had assigned.</li>
 </ul>
 
-<p>If a consensus contains zero running relays, we skip it in the <a 
href="/relays-ipv6.html">Relays by IP version</a> graph, but not in the other 
graphs (simply because we didn't get around to changing those graphs).
+<p>If a consensus contains zero running relays, we skip it.
 This is mostly to rule out a rare edge case when only a minority of <a 
href="/glossary.html#directory-authority">directory authorities</a> voted on 
the <code>"Running"</code> flag.
 In those cases, such a consensus would skew the average, even though relays 
were likely running.</p>
 
 <h4>Step 2: Parse relay server descriptors</h4>
 
-<p>Parsing relay server descriptors is an optional step. You only need to do 
this if you want to break down the number of running relays by something that 
relays report in their server descriptors. This includes, among other things, 
the relay's platform string containing tor software version and operating 
system and whether the relay announced an IPv6 OR address or permitted exiting 
to IPv6 targets.</p>
-
 <p>Obtain relay server descriptors from <a 
href="/collector.html#type-server-descriptor">CollecTor</a>.
 Again, refer to the <a 
href="https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt";>Tor 
directory protocol, version 3</a> for details on the descriptor format.</p>
 
@@ -307,21 +296,16 @@ If the platform line is missing, we skip this descriptor, 
which later leads to n
 
 <h4>Step 3: Compute daily averages</h4>
 
-<p>Optionally, match consensus entries with server descriptors by SHA-1 digest.
+<p>Match consensus entries with server descriptors by SHA-1 digest.
 Every consensus entry references exactly one server descriptor, and a server 
descriptor may be referenced from an arbitrary number of consensus entries.
-We handle missing server descriptors differently in the graphs covered in this 
section:</p>
-
-<ul>
-<li><a href="/versions.html">Relays by tor version</a> and <a 
href="/platforms.html">Relays by platform</a>: If a referenced server 
descriptor is missing, we also skip the consensus entry. We are aware that this 
is slightly wrong, because we should either exclude a consensus with too few 
matching server descriptors from the overall result, or at least count these 
relays as unknown tor version or unknown platform.</li>
-<li><a href="/relays-ipv6.html">Relays by IP version</a>: If at least 0.1% of 
referenced server descriptors are missing, we skip the consensus. We chose this 
threshold as low, because missing server descriptors may easily skew the 
results. However, a small number of missing server descriptors per consensus is 
acceptable and also unavoidable.</li>
-</ul>
+If at least 0.1% of referenced server descriptors are missing, we skip the 
consensus. We chose this threshold as low, because missing server descriptors 
may easily skew the results. However, a small number of missing server 
descriptors per consensus is acceptable and also unavoidable.</p>
 
 <p>Go through all previously processed consensuses by valid-after UTC date.
 Compute the arithmetic mean of running relays, possibly broken down by relay 
flag, tor version, platform, or IPv6 capabilities, as the sum of all running 
relays divided by the number of consensuses.
 Round down to the next integer number.</p>
 
 <p>Skip the last day of the results if it matches the current UTC date, 
because those averages may still change throughout the day.
-For the <a href="/relays-ipv6.html">Relays by IP version</a> graph we further 
skip days for which fewer than 12 consensuses are known. The goal is to avoid 
over-representing a few consensuses during periods when the directory 
authorities had trouble producing a consensus for at least half of the day.</p>
+Further skip days for which fewer than 12 consensuses are known. The goal is 
to avoid over-representing a few consensuses during periods when the directory 
authorities had trouble producing a consensus for at least half of the day.</p>
 
 <h3 id="running-bridges" class="hover">Running bridges
 <a href="#running-bridges" class="anchor">#</a>
@@ -360,9 +344,6 @@ This timestamp is used to uniquely identify the status 
while processing, and the
 
 <h4>Step 2: Parse bridge server descriptors.</h4>
 
-<p>Parsing bridge server descriptors is an optional step. You only need to do 
this if you want to break down the number of running bridges by something that 
bridges report in their server descriptors.
-This includes, among other things, whether the bridge announced an IPv6 OR 
address.</p>
-
 <p>Obtain bridge server descriptors from <a 
href="/collector.html#type-bridge-server-descriptor">CollecTor</a>.
 As above, refer to the <a href="/bridge-descriptors.html">Tor bridge 
descriptors page</a> for details on the descriptor format.</p>
 
@@ -375,21 +356,16 @@ As above, refer to the <a 
href="/bridge-descriptors.html">Tor bridge descriptors
 
 <h4>Step 3: Compute daily averages</h4>
 
-<p>Optionally, match status entries with server descriptors by SHA-1 digest.
+<p>Match status entries with server descriptors by SHA-1 digest.
 Every status entry references exactly one server descriptor, and a server 
descriptor may be referenced from an arbitrary number of status entries.
 If at least 0.1% of referenced server descriptors are missing, we skip the 
status.
 We chose this threshold as low, because missing server descriptors may easily 
skew the results.
 However, a small number of missing server descriptors per status is acceptable 
and also unavoidable.</p>
 
-<p>We compute averages differently in the graphs covered in this section:</p>
-
-<ul>
-<li><a href="/networksize.html">Relays and bridges</a>: For each bridge 
authority, compute the arithmetic mean of running bridges as the sum of all 
running bridges divided by the number of statuses; sum up averages for all 
bridge authorities per day and round down to the next integer number.</li>
-<li><a href="/bridges-ipv6.html">Bridges by IP version</a>: Compute the 
arithmetic mean of running bridges as the sum of all running bridges divided by 
the number of statuses and round down to the next integer number. We are aware 
that this approach does not correctly reflect that bridges typically register 
at a single bridge authority only.</li>
-</ul>
+<p>Compute the arithmetic mean of running bridges as the sum of all running 
bridges divided by the number of statuses and round down to the next integer 
number. We are aware that this approach does not correctly reflect that bridges 
typically register at a single bridge authority only.</p>
 
 <p>Skip the last day of the results if it matches the current UTC date, 
because those averages may still change throughout the day.
-For the <a href="/bridges-ipv6.html">Bridges by IP version</a> graph we 
further skip days for which fewer than 12 statuses are known.
+Further skip days for which fewer than 12 statuses are known.
 The goal is to avoid over-representing a few statuses during periods when the 
bridge directory authority had trouble producing a status for at least half of 
the day.</p>
 
 <h3 id="consensus-weight" class="hover">Consensus weight
@@ -483,12 +459,7 @@ We consider a relay with the <code>"Guard"</code> flag as 
guard and a relay with
 
 <p>In order to compute these averages, first match consensus entries with 
server descriptors by SHA-1 digest.
 Every consensus entry references exactly one server descriptor, and a server 
descriptor may be referenced from an arbitrary number of consensus entries.
-We handle missing server descriptors differently in the graphs covered in this 
section:</p>
-
-<ul>
-<li><a href="/bandwidth.html">Total relay bandwidth</a> and <a 
href="/bandwidth-flags.html">Advertised and consumed bandwidth by relay 
flag</a>: If a referenced server descriptor is missing, we also skip the 
consensus entry. We are aware that this is slightly wrong, because we should 
rather exclude a consensus with too few matching server descriptors from the 
overall result than including it with an advertised bandwidth sum that is too 
low.</li>
-<li><a href="/advbw-ipv6.html">Advertised bandwidth by IP version</a>: If at 
least 0.1% of referenced server descriptors are missing, we skip the consensus. 
We chose this threshold as low, because missing server descriptors may easily 
skew the results. However, a small number of missing server descriptors per 
consensus is acceptable and also unavoidable.</li>
-</ul>
+If at least 0.1% of referenced server descriptors are missing, we skip the 
consensus. We chose this threshold as low, because missing server descriptors 
may easily skew the results. However, a small number of missing server 
descriptors per consensus is acceptable and also unavoidable.</p>
 
 <p>Go through all previously processed consensuses by valid-after UTC date.
 Compute the arithmetic mean of advertised bandwidth as the sum of all 
advertised bandwidth values divided by the number of consensuses.
@@ -497,7 +468,7 @@ Round down to the next integer number.</p>
 <p>Break down numbers by guards and/or exits by taking into account which <a 
href="/glossary.html#relay-flag">relay flags</a> a consensus entry had that 
referenced a server descriptor.</p>
 
 <p>Skip the last day of the results if it matches the current UTC date, 
because those averages may still change throughout the day.
-For the <a href="/advbw-ipv6.html">Advertised bandwidth by IP version</a> 
graph we further skip days for which fewer than 12 consensuses are known.
+Further skip days for which fewer than 12 consensuses are known.
 The goal is to avoid over-representing a few consensuses during periods when 
the directory authorities had trouble producing a consensus for at least half 
of the day.</p>
 
 <h4>Step 4: Compute ranks and percentiles</h4>

_______________________________________________
tor-commits mailing list
[email protected]
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-commits

[tor-commits] [metrics-web/master] Update Reproducible Metrics document.

Reply via email to