On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohde<[email protected]> wrote:
> On Fri, Jun 5, 2009 at 6:38 PM, Tim Starling<[email protected]> wrote:
>> Peter Gervai wrote:
>>> Is there a possibility to write a code which process raw squid data?
>>> Who do I have to bribe? :-/
>>
>> Yes it's possible. You just need to write a script that accepts a log
>> stream on stdin and builds the aggregate data from it. If you want
>> access to IP addresses, it needs to run on our own servers with only
>> anonymised data being passed on to the public.
>>
>> http://wikitech.wikimedia.org/view/Squid_logging
>> http://wikitech.wikimedia.org/view/Squid_log_format
>>
>
> How much of that is really considered private? IP addresses
> obviously, anything else?
>
> I'm wondering if a cheap and dirty solution (at least for the low
> traffic wikis) might be to write a script that simply scrubs the
> private information and makes the rest available for whatever
> applications people might want.
There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0;
bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be
uniquely identifying). There is even private data titles if you don't
sanitize carefully
(/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
There is private data in referrers
(http://rarohde.com/url_that_only_rarohde_would_have_comefrom).
Things which individually do not appear to disclose anything private
can disclose private things (look at the people uniquely identified by
AOL's 'anonymized' search data).
On the flip side, aggregation can take private things (i.e.
useragents; IP info; referrers) and convert it to non-private data:
Top user agents; top referrers; highest traffic ASNs... but becomes
potentially revealing if not done carefully: The 'top' network and
user agent info for a single obscure article in a short time window
may be information from only one or two users, not really an
aggregation.
Things like common paths through the site should be safe so long as
they are not provided with too much temporal resolution, limit
themselves to existing articles, and limit themselves to either really
common paths or breaking paths into two or three node chains and skip
releasing the least common of those.
Generally when dealing with private data you must approach it with the
same attitude that a C coder must take to avoid buffer overflows.
Treat all data as hostile, assume all actions are potentially
dangerous. Try to figure out how to break it, and think deviously.
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l