Scrubbing log files to make the data private is hard work. You'd be impressed by what researchers have been able to do - taking purportedly anonymous data and using it to identify users en masse by correlating it with publicly available data from other sites such as Amazon, Facebook and Netflix. Make no doubt - if you don't do it carefully you will be the target of, in the best of cases, an academic researcher who wants to prove that you don't understand statistics.
On Fri, Jun 5, 2009 at 8:13 PM, Robert Rohde <[email protected]> wrote: > On Fri, Jun 5, 2009 at 6:38 PM, Tim Starling<[email protected]> > wrote: > > Peter Gervai wrote: > >> Is there a possibility to write a code which process raw squid data? > >> Who do I have to bribe? :-/ > > > > Yes it's possible. You just need to write a script that accepts a log > > stream on stdin and builds the aggregate data from it. If you want > > access to IP addresses, it needs to run on our own servers with only > > anonymised data being passed on to the public. > > > > http://wikitech.wikimedia.org/view/Squid_logging > > http://wikitech.wikimedia.org/view/Squid_log_format > > > > How much of that is really considered private? IP addresses > obviously, anything else? > > I'm wondering if a cheap and dirty solution (at least for the low > traffic wikis) might be to write a script that simply scrubs the > private information and makes the rest available for whatever > applications people might want. > > -Robert Rohde > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
