Re: [Wikitech-l] Unbreaking statistics

John at Darkstar Sun, 07 Jun 2009 00:09:56 -0700

Some articles are always very seldom referred and those can be used to
uniquely identify a machine. Then there are all those who do something
that goes into public logs. The later are very difficult to obfuscate,
but the first one is possible to solve by setting a time frame long
enough that sufficient alternate traffic will be within the same window.
Unfortunately this time frame is pretty long for some articles, and from
some tests it seems to be weeks on Norsk (bokmål) Wikipedia.
John


Robert Rohde skrev:
> On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwell<[email protected]> wrote:
>> On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohde<[email protected]> wrote:
>> There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0;
>> bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be
>> uniquely identifying). There is even private data titles if you don't
>> sanitize carefully
>> (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box).
>>  There is private data in referrers
>> (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).
>>
>> Things which individually do not appear to disclose anything private
>> can disclose private things (look at the people uniquely identified by
>> AOL's 'anonymized' search data).
>>
>> On the flip side, aggregation can take private things (i.e.
>> useragents; IP info; referrers) and convert it to non-private data:
>> Top user agents; top referrers; highest traffic ASNs... but becomes
>> potentially revealing if not done carefully: The 'top' network and
>> user agent info for a single obscure article in a short time window
>> may be information from only one or two users, not really an
>> aggregation.
>>
>> Things like common paths through the site should be safe so long as
>> they are not provided with too much temporal resolution, limit
>> themselves to existing articles, and limit themselves to either really
>> common paths or breaking paths into two or three node chains and skip
>> releasing the least common of those.
>>
>> Generally when dealing with private data you must approach it with the
>> same attitude that a C coder must take to avoid buffer overflows.
>> Treat all data as hostile, assume all actions are potentially
>> dangerous. Try to figure out how to break it, and think deviously.
> 
> On reflection I agree with you, though I think the biggest problem
> would actually be a case you didn't mention.  If one provided timing
> and page view information, then one can almost certainly single out
> individual users by correlating the view timing with edit histories.
> 
> Okay, so no stripped logs.  The next question becomes what is the
> right way to aggregate.  We can A) reinvent the wheel, or B) adapt a
> pre-existing log analyzer in a mode to produce clean aggregate data.
> While I respect the work of Zachte and others, this might be a case
> where B is a better near-term solution.
> 
> Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that
> started this mess), his AWStats config already suppresses IP info and
> aggregates everything into groups that make it very hard to identify
> anything personal from.  (There is still a small risk with allowing
> users to drill down to pages / requests that are almost never made,
> but perhaps that could be turned off.)  AWStats has native support for
> Squid logs and is open source.
> 
> This is not necessarily the only option, but I suspect that if we gave
> it some thought it would be possible to find an off-the-shelf tool
> that would be good enough to support many wikis and configurable
> enough to satisfy even the GMaxwell's of the world ;-).  huwiki is
> actually the 20th largest wiki (by number of edits), so if it worked
> for them, then a tool like AWStats can probably work for most of the
> projects (which are not EN).
> 
> -Robert Rohde
> 
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Unbreaking statistics

Reply via email to