On 8/25/11 10:08 AM, Karsten Loesing wrote: > we have been discussing sanitizing and publishing our web server logs > for quite a while now. The idea is to remove all potentially sensitive > parts from the logs, publish them in monthly tarballs on the metrics > website, and analyze them for top visited pages, top downloaded > packages, etc. See the tickets #1641 and #2489 for details. > > Here's a suggested sanitizing procedure for our web logs, which are in > Apache's combined log format: > > - Ignore everything except GET requests. > - Ignore all requests that resulted in a 404 status code. > - Rewrite log lines so that they only contain the following fields: > - IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests > (as logged by our Apache configuration), > - the request date (with the time part set to 00:00:00), > - the requested URL (cut off at the first encountered "?"), > - the HTTP version, > - the server's HTTP status code, and > - the size of the returned object. > - Write all lines from a given virtual host and day to a single output > file. > - Sort the output file alphanumerically to conceal the original order > of requests.
Pushing this forward. Here are the sanitized web logs that we'd like to publish on a daily basis for all our web servers and virtual domains for all of 2010 (155M): http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-01.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-02.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-03.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-04.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-05.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-06.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-07.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-08.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-09.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-10.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-11.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-12.tar The webalizer output for www.torproject.org can be viewed here: http://freehaven.net/~karsten/volatile/www.torproject.org-webalizer/ So. Is it safe to publish these logs on a daily basis? The same questions from my original mail apply here: > Is there still anything sensitive in that log file that we should > remove? For example: > - Do the logs reveal how many pages were cached already on the > requestor's site (e.g. as repeat accesses)? Note that log files are > sorted before being published. > - Are there other concerns about making these sanitized log files > publicly available? > > Are the decisions to remove parts from the logs reasonable? In particular: > - Do we have to take out all requests with 404 status codes? Some of > these requests for non-existing URLs contain typos which may not be safe > to make public. Should we instead put in some placeholder for the URL > part and keep the 404 lines to know how many 404's we have per day? > - Is there any good reason to keep the portion of a URL after a "?"? > - Is it possible to leave some part of Referers in the logs that helps > us figure out where our traffic originates and what search terms people > use to find us? > - Can we resolve client IP addresses to country codes and include those > in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS? How > would we handle countries with only a few users per day, e.g., should > there be a threshold below which we consider requests to come from "a > country with less than XY users?" The next steps will be to make these sanitized logs available on a daily basis and to publish the sanitized archives from 2008, 2009, and 2011. I'm going to wait another week (probably longer) for feedback before taking these next steps. Best, Karsten _______________________________________________ tor-dev mailing list [email protected] https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
