Hello tor-dev,

While helping design ways to publish statistics about hidden services in a 
privacy-preserving
manner, it has become clear to me that certain statistics cannot be safely 
reported using the
current  method of having each relay collect and report measurements. I am 
going to describe a
couple of simple protocols to handle this problem that I think should be 
implementable without much
effort. I'd be happy to get feedback in particular about the security or 
ease-of-implementation of
these protocols.

Two HS statistics that we (i.e. people working on Sponsor R) are interested in 
collecting are:
  1. The number of descriptor fetches received by a hidden-service directory 
(HSDir)
  2. The number of client introduction requests at an introduction points (IPs)
The privacy issue with #1 is that the set of HSDirs is (likely) unique to an 
HS, and so
the number of descriptor fetches at its HSDirs could reveal the number of 
clients it had during a
measurement period. Similarly, the privacy issue with #2 is that the set of IPs 
are (likely)
unique to an HS, and so the number of client introductions at its IPs could 
reveal the number of
client connections it received.

A approach to solve this problem would be to anonymize the reported statistics. 
Doing so raises
a couple of challenges, however:
  1. Anonymous statistics should be authenticated as coming from some relay. 
Otherwise, statistics
  could be polluted by any malicious actor.
  2. Statistical inference should be made robust to outliers. Without the relay 
identities, it will
  be difficult to detect and remove values that are incorrect, whether due to 
faulty measurement or
  malicious action by a relay.

I propose some simple cryptographic techniques to privately collect the above 
statistics while
handling the above challenges. I assume that there exists a set of Statistics 
Authorities
(StatAuths), of which at least one must be honest for the protocol to be 
secure, but all of which
can be "curious" (that is, we want to maintain privacy from them as well and 
allow their action to
be completely public). The Directory
Authorities could serve as StatAuths. A single server could as well, if you 
trust it to be honest-
but-curious. If you have a trusted non-curious server, things become much 
simpler, of course: just
have relays report their values to the server and then have it publish a global 
aggregate only.

The AnonStats1 protocol to privately publish both statistics if we trust relays 
not to pollute the
statistics (i.e. #2 is not a problem) is as follows:
  1. Each StatAuth provides 2k partially-blind signatures on authentication 
tokens to each relay in
  a consensus during the measurement period. The blind part of a signed token 
is simply a random
  number chosen by the relay. The non-blind part of a token consists of the 
start time of the
  current measurement period. The 2k tokens will allow the relay to submit k 
values to the
  StatAuths. Note that we could avoid using partially-blind signatures by 
changing keys at the
  StatAuths every measurement period and then simply providing blind signatures 
on random numbers.
  2. At the end of the measurement period, for each statistic, each relay 
uploads the following
  each on its own Tor circuit and accompanied by a unique token from each 
StatAuth:
    1. The count
    2. The ``statistical weight'' of the relay (1/(# HSDirs) for statistic #1 
and the probability of
    selection as an IP for statistic #2)
  3. The StatAuths verify that each uploaded value is accompanied by a unique 
token from each
  StatAuth that is valid for the current measurement period. To infer the 
global statistic from
  the anonymous per-relay statistic, the StatAuths add the counts, add the 
weights, and divide
  the former by the latter.

The AnonStats1 protocol is vulnerable to a relay that publishes manipulated 
statistics (i.e.
challenge #2). If this is a concern, the AnonStats2 protocol mitigates it by 
using a median
statistic intead of a sum:
  1. Relays are divided into b bins by consensus weight. The bins have 
exponentially-increasing
  length, and the rate of increase c is set such that the
  ith bin by increasing weights has at least r relays each of weight at least 
some minimum
  min_weight (say, r=500, min_weight=100). The first bin has weights in [0, 
min_weight), and the ith
  bin has weights in [min_weight*c^(i-2), min_weight*c^(i-1)).
  2. Each StatAuth provides k partially-blind signatures on authentication 
tokens to each relay in
  a consensus during the measurement period. The blind part of a signed token 
is simply a random
  number chosen by the relay. The non-blind part of a token consists of the 
start time of the
  current measurement period and the bin containing the relay's consensus 
weight.
  3. At the end of the measurement period, for each statistic, each relay 
divides each statistic
  by the relay's ``statistical weight'' (1/(# HSDirs) for statistic #1 and the 
probability of
  selection as an IP for statistic #2). The result is the relay's estimate of 
the global
  value of the statistic. The relay then uploads this value via a new Tor 
circuit, accompanied by a
  unique token from each StatAuth.
  4. The StatAuths verify that each uploaded value is accompanied by a unique 
token from each
  StatAuth that is valid for the current measurement period and that contains 
the same relay-weight
  bin. To infer a final global statistic from the anonymous per-relay 
estimates, the StatAuths
  use the weighted median of the received estimates, where the weight of an 
estimates is taken to be
  the smallest value of its accompanying bin (i.e. the bin's left edge).
_______________________________________________
tor-dev mailing list
[email protected]
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Reply via email to