If you want to hear about the results of this research collaboration, or have additional questions about the data collection approach or the analysis, I invite you to come and join us at our upcoming showcase on *Wednesday 11/16. *
https://lists.wikimedia.org/pipermail/analytics/2016-November/005504.html On Tue, Nov 8, 2016 at 10:42 AM, Dario Taraborelli < [email protected]> wrote: > > On Tue, Nov 8, 2016 at 9:10 AM, James Salsman <[email protected]> wrote: > >> I assumed that when an affiliated researcher apart from Foundation >> staff says, "we have the complete server logs for Wikipedia," >> amounting to 17 terabytes per month, that means they possess the >> information. I am glad to be wrong about that, but I object to the >> implication that such an assumption based on the plain language of >> the statement could possibly be made in bad faith. >> > > I am glad we cleared that confusion. > > >> > the terms of our formal collaborations >> > https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations >> > prohibit the sharing of any raw data containing PII (such as >> > webrequest logs) outside of WMF operated servers, >> >> There is nothing on that page which suggests that prohibition. >> > > You're correct that that document doesn't describe in detail the data > access process. When we start a formal collaboration under an NDA, we have > an onboarding process that gives researchers restricted access to our > cluster, covers server access responsibilities and best practices around > the handling of private data. I'll check with our Legal and Security team > if we can better document this process. > > >> > as well as the retention of any such data past our data retention >> > period https://meta.wikimedia.org/wiki/Data_retention_guidelines >> >> That page says, "Information (including personal information) >> collected through participation in a survey or other research >> conducted by the Wikimedia Foundation will be retained indefinitely >> for educational, development, or other related purposes, unless >> otherwise indicated in the privacy policy or statement of such >> survey or research." >> > > This is for surveys requesting explicit (*opt in*) consent to collect and > retain specific types of data (such as demographic information) from > participants, not for data collected by default via our webrequest logs. > Webrequest logs and instrumentation data is purged/sanitized by default > within a the 90-day retention window, most often the data sits on our > servers for a much shorter time and is removed in a shorter time frame. > > >> https://meta.wikimedia.org/w/index.php?title=Talk:2016_Strat >> egy/Draft_WMF_Strategy&diff=15467086&oldid=15466763 >> says that the Foundation's standard research NDAs include an >> "obligation to return or destroy any copies of confidential >> information the individual may have upon request by WMF" >> >> Does that not imply that such copies are allowed in general? >> > > IANAL so I can't comment on that but I believe this is a clause that's > part of our NDA to avoid confidential information (not specifically PII) to > be retained by third parties past the terms of the NDA. > > >> I hope we can move forward to a solution to the general problem. >> >> Is there any legitimate research or any other need to save IP >> addresses associated with HTTP GET web logs to disk prior to >> creating a secure hash of them? >> > > these are considerations that the analytics / ops team are best suited to > answer, I encourage you to relay them to analytics-l if you want to have a > more technical discussion. > > HTH, > Dario > > -- *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter> _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[email protected]?subject=unsubscribe>
