Hi George,

I’m glad you’re putting serious thought into these stats. I’ll give you my 
perspective on some of the issues you raise.

> I will now enumerate the stats that Aaron considers interesting and
> low-hanging-fruit:

I should mention that all of these came out of a list that came out of Roger’s 
mouth, and so you might try and get further thoughts from him.

> This time I'm going to put extra focus on how to use these statistics
> and _what questions they help us answer_. If these stats don't help us
> answer any interesting questions, they are not that useful.

I think that overall many statistics are useful just to check for abuse, 
misconfiguration, or bugs. If the statistic is way out of line of what we would 
expect, especially when compared to other statistics, then that would reveal an 
unexpected and potentially problematic behavior.

> Also, this
> time we should have an *exact strategy* on how to use specific stats
> to derive the results we want, so that we don't spend 2 months after
> we write the code to figure out how to do extrapolations.

I agree that it is important to be confident that we can use the data that we 
collect. Paul and I actually went through many of the desired statistics early 
on (during the kickoff meeting in mid-September) sketching out how 
extrapolation would work. I had attached that document to Trac ticket #13509, 
although it may be hard to understand.

>> (1) Number of descriptor updates (total count and distribution) (Sec. 4.2.4)
...
> I'm not yet convinced this is a useful stat. What is its use and which
> *questions* would it help us answer?

In addition to revealing if somebody is sending way too many updates, it would  
help us understand the general level of churn of hidden services. Are there 
lots of short-lived services? 

> I'm assuming that we would total count here, since revealing the exact
> distribution could leak information about specific hidden services.

I believe that the distribution can be revealed to some extent safely. You 
choose a small number of bins chopping up the possible numbers of updates, and 
then publish the counts for each bin in the same way that you would publish a 
single overall count. The details are in the stats tech report.

> Also, this is related to the "Number of unique HSes per HSDir"
> statistic that we are already doing. This means, that we can do the
> division and arrive to "Average number of descriptor updates per HS".
> I'm not sure if I like this, since there are *specific* HSes
> corresponding to each HSDir. Are we sure that there are not edge-cases
> that this can be exploited to learn their uptime? I'm not.

I do think that if you know of a specific HS, then you can watch the descriptor 
update stats from its HSDir over time and gradually learn about how many times 
that HS updates its descriptors. But if you know of a specific HS, you can do 
that anyway simply by fetching the descriptors. Thus this doesn’t seem like a 
problem to me.

>> (2) Number of RPs established on relays
...
> OK, I can see how this stat would give us the number of "connection
> attempts there are by clients to services that are running". Is this a
> number we are interested in? I guess so maybe.

I think this is very interesting. How much traffic tends to flow over a typical 
HS circuit? Are there a huge number of established RPs relative to the amount 
of traffic (this could indicate either DoS or botnet clients)? Do clients make 
lots of little connections or fewer large ones?

>> Number of circuits using TAP and nTor
...
> This statistic can reveal other information too since it's basically a
> circuit count. For example, if you count and publish the number of
> circuits containing ESTABLISH_INTRO, you get the "Number of IPs
> established on the network" statistic.  If you count and publish the
> number of circuits containing ESTABLISH_RENDEZVOUS, you get the
> "Number of RPs established on relays" statistic I discussed in the
> previous section.

Agreed.

> Also, why do we care how many hidden services are using older versions
> of Tor? And why do we care how many clients are using older versions
> of Tor?  Is this to specifically detect botnet activity?

Roger has mentioned this a couple of times, both in the context of identifying 
botnet activity. I think more generally, it would be helpful to Tor to 
understand the distribution of software versions in active use among clients 
and HSes. This would help them better target upgrading if necessary to improve 
user security, and it could reveal when older versions are out of use and can 
be safely end-of-lifed.

> Also, why do this just for hidden services? 

It is interesting for HSes to figure out how much HS activity is from botnets. 
I agree that it is interesting more generally as well.

>> Number of descriptors with encrypted introduction points
...
> This seems like a stat that would answer a very concete question "How
> many hidden services are using authorization currently?".
> 
> Answering this question seems useful for evaluating the user base and
> popularity of this feature.

Yes, agreed. Among other things, this could help direct Tor to improve the 
usability of such a feature.

> However, I'm not sure if I want to learn this information at
> all. People who use hidden service authorization are cautious users,
> and it seems weird to count them like this. It might be okay if there
> are 10000 of these hidden services, but if there are only 100, I
> wouldn't want to out them like this. More thinking required.

I agree that no individual service should be revealed. That is why we would 
round and add noise as usual. That would hide the existence of any small number 
of services (we have used 8 for similar purposes).

Cheers,
Aaron
_______________________________________________
tor-dev mailing list
[email protected]
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Reply via email to