Re: [Analytics] Analytics project request

2017-07-24 Thread Leila Zia
Hi Daniel,

I reviewed your request.

== Context ==
* The data you're asking for is one of the most frequently requested
data-sets. We also receive quite a bit of interest for that data
specifically for the general research direction you're interested in.
* Resources are highly limited on our end. Every formal collaboration will
need to be created taking into account this constraint
​ and the commitments we have already made.​


== When Research can sign up for formal collaborations? ==
At least one of the conditions below should hold for us to be able to
consider creating a new formal collaboration at this point in time:
* The outside research is (tightly) aligned with one of our annual plan
commitments (for the period of July 1, 2017 to June 30, 2018). [1]
* If a researcher in Research team picks up a specific direction for
exploration based on their expertise/interest.
* If access to data is broadly agreed upon as strategic for humanity. The
examples in this direction are rare, but to give you a sense: if there is
an epidemic and we know, with some certainty, that the data we have can
help control it or help understanding the research and development in that
space.

== Access to data ==
At this point, unfortunately we cannot
​create a formal collaboration for your request
. I hope that this email can transfer our disappointment to convey this
message
​.​
:(

Th
​e above​
being said, I think there is one data-set that can be helpful for your
research and that's Wikipedia Clickstream dataset. [2] You can use that
dataset to compute the transition probabilit
​y
 of moving from one English Wikipedia
​ article​
to another. The data is not refreshed frequently, but refreshing that at
specific snapshots in time is something we can consider. Please work with
the dataset, if you haven't, and let us know if that can be of help for you.

Best,
Leila


[1] All
​programs Research has committed to are listed below. Specific objectives
within each Program Research has signed up for is at
https://phabricator.wikimedia.org/tag/research-programs/
​
​

Program 4
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_4:_Technical_community_building

​Program 7
​
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_7._Smart_tools_for_better_data

Program 9
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_9:_Growing_Wikipedia_across_languages_via_recommendations

Program 11
​
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_11:_Improving_citations_across_Wikimedia_projects

Program ​12
​
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_12:_Grow_contributor_diversity

CD - Community Health
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Community_Health#Segment_3:_Research_on_harassment

CD - Structured Data
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Structured_Data#Segment_4:_Programs


[2] https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream


--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Mon, Jul 24, 2017 at 9:24 AM, Leila Zia  wrote:

> I'll review Daniel's email and will get back to him/you on this list
> in the next day or so.
>
> Leila
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
>
> On Mon, Jul 24, 2017 at 7:59 AM, Nuria Ruiz  wrote:
> > Daniel,
> >
> > Singining an NDA is not enough to get access to the data, you also need
> to
> > be part of  a formal research collaboration with our research team, they
> > have a number of those and they are not likely to accept any more soon
> but
> > you can contact them on that regard:
> > https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
> >
> > Thanks,
> >
> > Nuria
> >
> >
> >
> > On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski <
> daniel.ober...@gmail.com>
> > wrote:
> >>
> >> Dear list,
> >>
> >> I'm posting a recent conversation with Dan below, as well as a few
> >> follow-up questions.
> >>
> >> Dan was kind enough to point out this list. I apologize that the post is
> >> "backward" (in
> >> email-thread format) due to my ignorance, will use this list from now
> on.
> >>
> >> Thanks, Daniel
> >>
> >>
> >> 
> >>
> >> Hi Dan
> >>
> >>
> >> Thanks for getting back to me so quickly!
> >>
> >> >Thanks for writing.  In general these questions are best asked on our
> >> > public list, so other
> >> >people can see and benefit from any answers:
> >> > https://lists.wikimedia.org/mailman/listinfo/
> >> >analytics
> >>
> >> Thanks, I've joined this list and will ask subsequent questions there.
> >>
> >> >* pairs of pages: we have two datasets that are mentioned in this task
> >> > https://
> >> 

Re: [Analytics] Analytics project request

2017-07-24 Thread Leila Zia
I'll review Daniel's email and will get back to him/you on this list
in the next day or so.

Leila

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation


On Mon, Jul 24, 2017 at 7:59 AM, Nuria Ruiz  wrote:
> Daniel,
>
> Singining an NDA is not enough to get access to the data, you also need to
> be part of  a formal research collaboration with our research team, they
> have a number of those and they are not likely to accept any more soon but
> you can contact them on that regard:
> https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
>
> Thanks,
>
> Nuria
>
>
>
> On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski 
> wrote:
>>
>> Dear list,
>>
>> I'm posting a recent conversation with Dan below, as well as a few
>> follow-up questions.
>>
>> Dan was kind enough to point out this list. I apologize that the post is
>> "backward" (in
>> email-thread format) due to my ignorance, will use this list from now on.
>>
>> Thanks, Daniel
>>
>>
>> 
>>
>> Hi Dan
>>
>>
>> Thanks for getting back to me so quickly!
>>
>> >Thanks for writing.  In general these questions are best asked on our
>> > public list, so other
>> >people can see and benefit from any answers:
>> > https://lists.wikimedia.org/mailman/listinfo/
>> >analytics
>>
>> Thanks, I've joined this list and will ask subsequent questions there.
>>
>> >* pairs of pages: we have two datasets that are mentioned in this task
>> > https://
>> >phabricator.wikimedia.org/T158972 which should be very interesting for
>> > this purpose.  They
>> >aren't being updated right now, and the task is to do just that.  We'll
>> > probably get to
>> >that within the next 3 months, but a bunch of us are on paternity leave
>> > this summer, so
>> >things are a little slower than normal
>>
>> This seems close to what I need. From the descriptions I gather the
>> linkage is by session.
>> Is there also a linkage by ip (with IP's removed of course)?
>>
>> >* country data for pageviews: for privacy reasons we only allow access to
>> > this with an
>> >NDA.  We have good data on it, but you need to sign this NDA and use our
>> > cluster to access
>> >it, being careful about what you report about it to the world at large.
>> > Here's information
>> >on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
>>
>> I've read this and am happy to sign an NDA. I understand it is best to be
>> as specific as
>> possible about the reasoning, intentions with the data, and permissions
>> required. For me to
>> figure this out it would be useful to know the relevant parts of the
>> database schema, and
>> perhaps a hint as to which data might be most interesting there. Would you
>> be able to point
>> me towards that?
>>
>> >Hope that helps, and feel free to write back to the public list in the
>> > future.
>>
>> Definitely, very helpful and thank you!
>>
>> Best, Daniel
>>
>>
>> On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel)
>>  wrote:
>> Dear Dan,
>>
>>
>> My name is Daniel Oberski, I'm an associate professor of data science
>> methodology in the
>> department of statistics at Utrecht University in the Netherlands.
>>
>> I've been using your incredibly useful pageviews API to study correlations
>> between the
>> amount of interest people show in a topic (pageviews) with other data such
>> as political
>> party preference over time. That has yielded some interesting results
>> (which I have yet to
>> write up).
>>
>> However, to do a better study it would be very helpful to have slightly
>> more information
>> than is in the API. Specifically, it would be very useful to be able to
>> query, for each
>> _pair_ of pages, how many people (or IP's) viewed _both_ of those pages.
>> That way I can find
>> out which pages are really indicative of interest in a specific common
>> topic, rather than
>> just correlated by accident. In addition, I've found it hard to figure out
>> pageviews for
>> specific pages by country rather than language.
>>
>> My question is, would you happen to know if is there any way to obtain
>> this information?
>> (does not necessarily have to be through the API.) Or do you know if there
>> are people to
>> whom I might talk about this?
>>
>> Thanks for reading (to) the end and best regards,
>>
>> Daniel
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Analytics project request

2017-07-24 Thread Nuria Ruiz
Daniel,

Singining an NDA is not enough to get access to the data, you also need to
be part of  a formal research collaboration with our research team, they
have a number of those and they are not likely to accept any more soon but
you can contact them on that regard:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations

Thanks,

Nuria



On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski 
wrote:

> Dear list,
>
> I'm posting a recent conversation with Dan below, as well as a few
> follow-up questions.
>
> Dan was kind enough to point out this list. I apologize that the post is
> "backward" (in
> email-thread format) due to my ignorance, will use this list from now on.
>
> Thanks, Daniel
>
>
> 
>
> Hi Dan
>
>
> Thanks for getting back to me so quickly!
>
> >Thanks for writing.  In general these questions are best asked on our
> public list, so other
> >people can see and benefit from any answers: https://lists.wikimedia.org/
> mailman/listinfo/
> >analytics
>
> Thanks, I've joined this list and will ask subsequent questions there.
>
> >* pairs of pages: we have two datasets that are mentioned in this task
> https://
> >phabricator.wikimedia.org/T158972 which should be very interesting for
> this purpose.  They
> >aren't being updated right now, and the task is to do just that.  We'll
> probably get to
> >that within the next 3 months, but a bunch of us are on paternity leave
> this summer, so
> >things are a little slower than normal
>
> This seems close to what I need. From the descriptions I gather the
> linkage is by session.
> Is there also a linkage by ip (with IP's removed of course)?
>
> >* country data for pageviews: for privacy reasons we only allow access to
> this with an
> >NDA.  We have good data on it, but you need to sign this NDA and use our
> cluster to access
> >it, being careful about what you report about it to the world at large.
> Here's information
> >on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
>
> I've read this and am happy to sign an NDA. I understand it is best to be
> as specific as
> possible about the reasoning, intentions with the data, and permissions
> required. For me to
> figure this out it would be useful to know the relevant parts of the
> database schema, and
> perhaps a hint as to which data might be most interesting there. Would you
> be able to point
> me towards that?
>
> >Hope that helps, and feel free to write back to the public list in the
> future.
>
> Definitely, very helpful and thank you!
>
> Best, Daniel
>
>
> On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) 
> wrote:
> Dear Dan,
>
>
> My name is Daniel Oberski, I'm an associate professor of data science
> methodology in the
> department of statistics at Utrecht University in the Netherlands.
>
> I've been using your incredibly useful pageviews API to study correlations
> between the
> amount of interest people show in a topic (pageviews) with other data such
> as political
> party preference over time. That has yielded some interesting results
> (which I have yet to
> write up).
>
> However, to do a better study it would be very helpful to have slightly
> more information
> than is in the API. Specifically, it would be very useful to be able to
> query, for each
> _pair_ of pages, how many people (or IP's) viewed _both_ of those pages.
> That way I can find
> out which pages are really indicative of interest in a specific common
> topic, rather than
> just correlated by accident. In addition, I've found it hard to figure out
> pageviews for
> specific pages by country rather than language.
>
> My question is, would you happen to know if is there any way to obtain
> this information?
> (does not necessarily have to be through the API.) Or do you know if there
> are people to
> whom I might talk about this?
>
> Thanks for reading (to) the end and best regards,
>
> Daniel
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Analytics project request

2017-07-24 Thread Daniel Oberski
Dear list, 

I'm posting a recent conversation with Dan below, as well as a few follow-up 
questions. 

Dan was kind enough to point out this list. I apologize that the post is 
"backward" (in 
email-thread format) due to my ignorance, will use this list from now on. 

Thanks, Daniel


 

Hi Dan


Thanks for getting back to me so quickly! 

>Thanks for writing.  In general these questions are best asked on our public 
>list, so other 
>people can see and benefit from any answers: 
>https://lists.wikimedia.org/mailman/listinfo/
>analytics

Thanks, I've joined this list and will ask subsequent questions there. 

>* pairs of pages: we have two datasets that are mentioned in this task https://
>phabricator.wikimedia.org/T158972 which should be very interesting for this 
>purpose.  They 
>aren't being updated right now, and the task is to do just that.  We'll 
>probably get to 
>that within the next 3 months, but a bunch of us are on paternity leave this 
>summer, so 
>things are a little slower than normal

This seems close to what I need. From the descriptions I gather the linkage is 
by session. 
Is there also a linkage by ip (with IP's removed of course)?

>* country data for pageviews: for privacy reasons we only allow access to this 
>with an 
>NDA.  We have good data on it, but you need to sign this NDA and use our 
>cluster to access 
>it, being careful about what you report about it to the world at large.  
>Here's information 
>on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA

I've read this and am happy to sign an NDA. I understand it is best to be as 
specific as 
possible about the reasoning, intentions with the data, and permissions 
required. For me to 
figure this out it would be useful to know the relevant parts of the database 
schema, and 
perhaps a hint as to which data might be most interesting there. Would you be 
able to point 
me towards that?

>Hope that helps, and feel free to write back to the public list in the future.

Definitely, very helpful and thank you!

Best, Daniel


On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel)  
wrote:
Dear Dan,


My name is Daniel Oberski, I'm an associate professor of data science 
methodology in the 
department of statistics at Utrecht University in the Netherlands.

I've been using your incredibly useful pageviews API to study correlations 
between the 
amount of interest people show in a topic (pageviews) with other data such as 
political 
party preference over time. That has yielded some interesting results (which I 
have yet to 
write up).

However, to do a better study it would be very helpful to have slightly more 
information 
than is in the API. Specifically, it would be very useful to be able to query, 
for each 
_pair_ of pages, how many people (or IP's) viewed _both_ of those pages. That 
way I can find 
out which pages are really indicative of interest in a specific common topic, 
rather than 
just correlated by accident. In addition, I've found it hard to figure out 
pageviews for 
specific pages by country rather than language. 

My question is, would you happen to know if is there any way to obtain this 
information? 
(does not necessarily have to be through the API.) Or do you know if there are 
people to 
whom I might talk about this?

Thanks for reading (to) the end and best regards,

Daniel



___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics