mkroetzsch added a comment.

I agree with Stas: regular data releases are desirable, but need further thought. The task is easier for our current case since we already know what is in the data. For a regular process, one has to be very careful to monitor potential future issues. By releasing historic data, we avoid exploits that could be theoretically possible based on detailed knowledge of the methodology.

Regarding external IDs, one could whitelist unproblematic IDs that can be preserved, and obfuscate others. I agree that authority control IDs might identify humans, since they scope over so many things that are tied to particular humans (books, authors, etc.) that one could have a hypothetical situation where the interest in a particular item would already suggest who asked the query. I don't think something similar is even theoretically plausible for other IDs (e.g., for proteins or stars). Even for book ids, the lack of user traces makes it very hard to exploit this data further (the certainty you get from a single query being asked can hardly be high, and a query that helps you to guess who asked it will often not be interesting in its own right -- most likely you would want to know what else the identified person has asked). Anyway, we could restrict the "numerical strings are ok" rule to whitelisted properties for our current release. The main reason we have it at all are things like BlazeGraph's "radius" service parameter that have to be a number but are given as a string (I think the gas service might have similar cases).

There is a general limitation to potential exploits of SPARQL logs for breaching someone's privacy. If you don't control the software that formulated the query, then you can only connect queries to people if you already knew that only this person would ask this query. But then you learn very little by observing the query! On the other hand, if you control the software, then it would usually be easy to gather user data more directly, without needing the detour across some SPARQL logs released months later. One exception that might be relevant in the future is the use of SPARQL from Lua built-ins or MediaWiki tags on Wikipedia pages, which could in theory expose some page traffic. This is not relevant for our historic logs, and it would be hard to fully exploit due to parser caches and crawler-based hits, but it might become a theoretical issue nonetheless. To avoid it, one could either filter all Wikipedia servers from the logs, or use a separate SPARQL service for such requests (as discussed in Berlin), whose logs would not be released.

Considering our current dataset, it seems that even the obfuscation of strings is more than one would have to do, but in the future one might indeed have to add external URLs if they become more common in queries.


TASK DETAIL
https://phabricator.wikimedia.org/T183020

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: mkroetzsch, Smalyshev, DarTar, leila, Aklapper, Lahi, Gq86, GoranSMilovanovic, QZanden, Avner, Wikidata-bugs, aude, Capt_Swing, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to