-----BEGIN PGP SIGNED MESSAGE-----
On 20/09/16 15:43, Aaron Johnson wrote:
>> Good thinking! I summarized the methodology on the graph page
>> as: The graph above is based on sanitized Tor web server logs
>> . These are a stripped-down version of Apache's "combined" log
>> format without IP addresses, log times, HTTP parameters,
>> referers, and user agent strings.
>> If you spot anything in the data that you think should be
>> sanitized more thoroughly, please let us know!
> Interesting, thanks. Here are some thoughts based on looking
> through one of these logs (from archeotrichon.torproject.org
> <http://archeotrichon.torproject.org/> on 2015-09-20): 1. The
> order of requests appears to be preserved. If so, this allows an
> adversary to determine fine-grained timing information by inserting
> requests of his own at known times.
Log files are sorted as part of the sanitizing procedure, so that
request order should not be preserved. If you find a log file that is
not sorted, please let us know, because that would be a bug.
> 2. The size of the response is included, which potentially allows
> an adversary observing the client side to perform a correlation
> attack (combined with #1 above). This could allow the adversary to
> learn interesting things like (i) this person is downloading arm
> and thus is probably running a relay or (ii) this person is
> creating Trac tickets with onion-service bugs and is likely running
> an onion service somewhere (or is Trac excluded from these logs?).
> The size could also be used as an time-stamping mechanism
> alternative to #1 if the size of the request can be changed by the
> adversary (e.g. by blog comments).
This seems less of a problem with request order not being preserved.
And actually, the logged size is the size of the object on the server,
not the number of bytes written to the client. Even if these sizes
were scrubbed, it would be quite easy for an attacker to find out most
of these sizes by simply requesting objects themselves. On the other
hand, not including them would make some analyses unnecessarily hard.
I'd say it's reasonable to keep them.
> 3. Even without fine-grained timing information, daily per-server
> logs might include data from few enough clients that multiple
> requests can be reasonably inferred to be from the same client,
> which can collectively reveal lots of information (e.g. country
> based on browser localization used, platform, blog posts
> viewed/commented on if the blog server also releases logs).
We're removing almost all user data from request logs and only
preserving data about the requested object. For example, we're
throwing away user agent strings and request parameters. I don't
really see the problem you're describing here.
> I also feel compelled to raise the question of whether or not
> releasing these logs went through Tor’s own recommended procedure
> for producing data on its users
Git history says that those guidelines were put up in April 2016
whereas the rewrite of the web server log sanitizing code happened in
November 2015, with the original sanitizing process being written in
December 2011. So, no, we didn't go through that procedure yet, but
let's do that now:
> • Only collect data that is safe to make public.
We're only using data after making it public, so we're not collecting
anything that we think wouldn't be safe to make public.
> • Don't collect data you don't need (minimization).
I can see us using sanitized web logs from all Tor web servers, not
limited to Tor Browser/Tor Messenger downloads and Tor main website
hits. I used these logs to learn whether Atlas or Globe had more
users, and I just recently looked at Metrics logs to see which graphs
are requested most often.
> • Take reasonable security precautions, e.g. about who has access
> to your data sets or experimental systems.
We're doing that. For example, I personally don't have access to
non-sanitized web logs, just to the sanitized ones as everyone else.
> • Limit the granularity of data (e.g. use bins or add noise).
We're throwing out time information and removing request order.
> • The benefits should outweigh the risks.
I'd say this is the case. As you say below yourself, there is value
of analyzing these logs, and I agree. I have also been thinking a lot
about possible risks, which resulted in the sanitizing procedure that
is in place, which comes after the very restrictive logging policy at
Tor's Apache processes, which throws away client IP addresses and
other sensitive data right at the logging step. All in all, yes,
benefits do outweigh the risks here, in my opinion.
> • Consider auxiliary data (e.g. third-party data sets) when
> assessing the risks.
I don't see a convincing scenario where this data set would make a
third-party data set more dangerous.
> • Consider whether the user meant for that data to be private.
We're removing the user's IP address, request parameters, and user
agent string, and we're throwing out requests that resulted in a 404
or that used a different method than GET or HEAD. I can't see how a
user meant the remaining parts to be private.
> I definitely see the value of analyzing these logs, though, and it
> definitely helps that some sanitization was applied :-)
Glad to hear that.
We shall specify the sanitizing procedure in more detail as soon as
these logs are provided by CollecTor. I could imagine that we'll
write down the process similar to the bridge descriptor sanitizing
However, the current plan is to keep using the data provided by
webstats.torproject.org in the upcoming 9 months while we're busy with
other things. Just saying, don't hold your breath.
> Best, Aaron
All the best,
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
-----END PGP SIGNATURE-----
tor-dev mailing list