from:"Oliver Keyes"

Re: [Wiki-research-l] [Analytics] question about Pageviews dumps

2016-06-29 Thread Oliver Keyes

Aye, as Joseph says, the time-on-page or time-leaving is not collected,
except as an extension of session reconstruction work. If you want a
concrete time, you're not gonna get it.

While PC-based data is more reliable than mobile, that does not necessarily
mean "reliable". I'm sort of confused, I guess, as to why the datasets I
linked (unless I'm misremembering them?) don't help: you would have to do
the calculation yourself but they should contain all the data necessary to
make that calculation (unless you want to have the pageID or title
associated with the time-on-page, in which case...yeah, that's an issue).

On Wed, Jun 29, 2016 at 3:16 AM, Marc Miquel <marcmiq...@gmail.com> wrote:

> Thanks for the answer, Oliver. But I am not sure it answers my questions. I'd
> like to study aspects like how much time is spent in certain pages, as a
> proxy of how content is approached/read/understood. I'd be happy with time
> of entering the page, time of leaving. This is not entirely centered on
> 'user activity', but I said that because I imagined data would be stored in
> a similar way to editor sessions, or in a database and I would need to do
> the time calculations.
>
> Cheers,
>
> Marc
>
>
> El dc., 29 juny, 2016 03:11, Oliver Keyes <ironho...@gmail.com> va
> escriure:
>
>> If historic data is okay, there's already a dataset released (
>> https://figshare.com/articles/Activity_Sessions_datasets/1291033) that
>> was designed specifically to answer questions around how to best calculate
>> session length with regards to Wikipedia (http://arxiv.org/abs/1411.2878)
>>
>> On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquel <marcmiq...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> I was thinking about user sessions, yes, so this would mean to aggregate
>>> pageviews visited by a user during a short amount of time (I should check
>>> the cutoff, but it could be around an hour or less).
>>>
>>> I am particularly interested in understanding the order in which pages
>>> are seen (start, end), duration, etc.
>>> I wouldn't need data from a long period neither, but I think data from
>>> multiple languages would be helpful.
>>>
>>> I imagined reader data could be sensitive to privacy, but would an NDA
>>> with my university and some sort of data encoding help with this? As I
>>> said, it is for a scientific purpose.
>>>
>>> Thanks,
>>>
>>> Marc
>>>
>>> El dt., 28 juny 2016 a les 21:09, Nuria Ruiz (<nu...@wikimedia.org>) va
>>> escriure:
>>>
>>>>
>>>> Hello!
>>>>
>>>> >I am considering to study reader engagement for different article
>>>> topics in different languages. Because of this, I would like to know if
>>>> there is >any plan to make available pageviews dumps detailing activity log
>>>> at session level per user - in a similar way to editor sessions.
>>>>
>>>> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If so,
>>>> no we do not have any projects to provide that data as due to privacy
>>>> concerns we neither have nor keep that information.
>>>>
>>>> Thanks,
>>>>
>>>> Nuria
>>>>
>>>>
>>>>
>>>> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia <le...@wikimedia.org> wrote:
>>>>
>>>>> + Analytics
>>>>>
>>>>>
>>>>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiq...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a question for you regarding pageviews datadumps.
>>>>>>
>>>>>> I am considering to study reader engagement for different article
>>>>>> topics in different languages. Because of this, I would like to know if
>>>>>> there is any plan to make available pageviews dumps detailing activity 
>>>>>> log
>>>>>> at session level per user - in a similar way to editor sessions.
>>>>>>
>>>>>> Since this would be for a research project I might ask funding for
>>>>>> it, I would like to know if I could count on that, what is the nature of
>>>>>> the available data, and what would be the procedure to obtain this data 
>>>>>> and
>>>>>> if there would be any implication because of privacy concerns.
>>>>>>
>>>>>> Thank you

Re: [Wiki-research-l] [Analytics] question about Pageviews dumps

2016-06-28 Thread Oliver Keyes

If historic data is okay, there's already a dataset released (
https://figshare.com/articles/Activity_Sessions_datasets/1291033) that was
designed specifically to answer questions around how to best calculate
session length with regards to Wikipedia (http://arxiv.org/abs/1411.2878)

On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquel  wrote:

> Hello!
>
> I was thinking about user sessions, yes, so this would mean to aggregate
> pageviews visited by a user during a short amount of time (I should check
> the cutoff, but it could be around an hour or less).
>
> I am particularly interested in understanding the order in which pages are
> seen (start, end), duration, etc.
> I wouldn't need data from a long period neither, but I think data from
> multiple languages would be helpful.
>
> I imagined reader data could be sensitive to privacy, but would an NDA
> with my university and some sort of data encoding help with this? As I
> said, it is for a scientific purpose.
>
> Thanks,
>
> Marc
>
> El dt., 28 juny 2016 a les 21:09, Nuria Ruiz () va
> escriure:
>
>>
>> Hello!
>>
>> >I am considering to study reader engagement for different article
>> topics in different languages. Because of this, I would like to know if
>> there is >any plan to make available pageviews dumps detailing activity log
>> at session level per user - in a similar way to editor sessions.
>>
>> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If so, no
>> we do not have any projects to provide that data as due to privacy concerns
>> we neither have nor keep that information.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
>> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia  wrote:
>>
>>> + Analytics
>>>
>>>
>>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel 
>>> wrote:
>>>
 Hello,

 I have a question for you regarding pageviews datadumps.

 I am considering to study reader engagement for different article
 topics in different languages. Because of this, I would like to know if
 there is any plan to make available pageviews dumps detailing activity log
 at session level per user - in a similar way to editor sessions.

 Since this would be for a research project I might ask funding for it,
 I would like to know if I could count on that, what is the nature of the
 available data, and what would be the procedure to obtain this data and if
 there would be any implication because of privacy concerns.

 Thank you very much!

 Best,

 Marc Miquel
 ᐧ

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


>>>
>>> ___
>>> Analytics mailing list
>>> analyt...@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>> ___
>> Analytics mailing list
>> analyt...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Training for administrators

2016-06-10 Thread Oliver Keyes

Hey Pine,

That's good to hear; I am glad to hear people will be doing outreach on this.

I'd argue that this is one of those areas where experts are needed in
structuring the nature of the proposal, not just the materials used,
and so in the future it would be nice if they were involved in the
design of the grant request itself.

On Fri, Jun 10, 2016 at 3:06 PM, Pine W <wiki.p...@gmail.com> wrote:
> Hi Oliver,
>
> In terms of concrete plans to involve at least one expert, I've made a
> couple of initial queries to see if there is a professor at the University
> of Washington's Department of Psychology who would be qualified and
> interested to work on this proposal. Outreach and screening of potential
> consultants will take on greater importance if this idea gets traction on
> the Wikimedia side. At this point I think it's unlikely that I will be the
> project lead, so whoever does become the project lead on the Wikimedia side
> will likely need to do further work on outreach and screening to select the
> final expert(s).
>
> Pine
>
> On Tue, Jun 7, 2016 at 8:58 AM, Oliver Keyes <ironho...@gmail.com> wrote:
>>
>> Well, my feedback would be that for this to be useful I would expect
>> researchers *from* that subject-matter background to be involved. I
>> don't see this (nor a concrete plan for their involvement).
>>
>> On Mon, Jun 6, 2016 at 11:02 PM, Pine W <wiki.p...@gmail.com> wrote:
>> > Hi folks,
>> >
>> > Related to discussions that some of us have had previously about (1)
>> > developing training for Wikipedia administrators, (2) increasing the
>> > community's capacity to address incivility and harassment, and (3)
>> > improving
>> > community health, I've proposed
>> >
>> > https://meta.wikimedia.org/wiki/Grants:IdeaLab/Training_for_administrators
>> > as a part of the current Inspire campaign.
>> >
>> > This project could become an extension of my current video project, or
>> > it
>> > might be handled completely independently by a different project leader.
>> >
>> > Regardless of who eventually leads the project, I would appreciate your
>> > comments about the proposal, whether positive, negative, or indifferent.
>> > Please discuss on the IdeaLab pages. (:
>> >
>> > Thanks!
>> >
>> > Pine
>> >
>> > ___
>> > Wiki-research-l mailing list
>> > Wiki-research-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Training for administrators

2016-06-07 Thread Oliver Keyes

Well, my feedback would be that for this to be useful I would expect
researchers *from* that subject-matter background to be involved. I
don't see this (nor a concrete plan for their involvement).

On Mon, Jun 6, 2016 at 11:02 PM, Pine W  wrote:
> Hi folks,
>
> Related to discussions that some of us have had previously about (1)
> developing training for Wikipedia administrators, (2) increasing the
> community's capacity to address incivility and harassment, and (3) improving
> community health, I've proposed
> https://meta.wikimedia.org/wiki/Grants:IdeaLab/Training_for_administrators
> as a part of the current Inspire campaign.
>
> This project could become an extension of my current video project, or it
> might be handled completely independently by a different project leader.
>
> Regardless of who eventually leads the project, I would appreciate your
> comments about the proposal, whether positive, negative, or indifferent.
> Please discuss on the IdeaLab pages. (:
>
> Thanks!
>
> Pine
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Wikitech-l] Wired article about machine learning

2016-05-24 Thread Oliver Keyes

+100. Last I checked the Dartmouth Conference's premise still hadn't
been satisfied, so calling anything ML can do AI is just clickbait
froth. But I'm agreed that the non-AI "AI" stuff is both the power and
the danger here, and this kind of overselling is...risky.

As an example - this weekend Pro Publica published the results of a
study on automated model generation used for determining prisoner
reoffence risk. To the surprise of nobody, they found the models trend
towards automated racism little better than a coinflip.[0] It's never
going to write your code for you, or have a conversation about the
weather, or Codsworth it up,[1] but it's here nonetheless lurking in
the background, determining the course of human lives, and with more
ink spent 'explaining' how Robots Are Going To Eliminate Programming
than Robots Are Going To Automate Bigotry. Agreed on "scary" :|


[0] 
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[1] https://www.youtube.com/watch?v=3kacrYB8Li0

On Mon, May 23, 2016 at 10:33 AM, Aaron Halfaker
 wrote:
> Just a quick thought that I shared in IRC earlier.
>
>>  AI isn't magical.  It's pretty cool, but you're not going to have a
>> conversation with ORES.
>
>
> It's not false that we are closer to strong "conversational" AI than ever
> before.  Still, in practical terms, we're pretty far away from not needing
> to program anymore.  I find that articles like this are more fantastical
> than informative.  I guess it is interesting to think about where we'll be
> when we can have an abstract conversation with a computer system rather than
> the rigid specifics of programming, but I'm with Brian -- this seems to be a
> cycle.  Though, I'd say the media does boom and bust, but the research
> carries on relatively consistently since AI researchers are usually less
> interested in the hype.
>
> In the ORES project, we're using the most simplistic "AIs" available --
> classifiers.  Still these dumb AIs can still help us to do amazing things
> (e.g. review all of RecentChanges in 50x faster or augment article histories
> with information about the *type of change* made).  IMO, it's these amazing
> and powerful things that dumb, non-conversational AIs can do that is very
> powerful and a little scary.  We're hardly taking advantage of that at all.
> I think that's where the next big revolution with AI is taking place right
> now.  It's going to change a lot of things and infect many aspects of our
> life (and in many ways it already has).
>
> -Aaron
>
> On Fri, May 20, 2016 at 2:43 PM, Purodha Blissenbach
>  wrote:
>>
>> I see only an ad to support Wired.
>> Purodha
>>
>>
>> On 20.05.2016 20:11, Pine W wrote:
>>>
>>> Seems like a good summary: http://www.wired.com/2016/05/the-end-of-code/
>>>
>>> Comments welcome, especially from Wikimedia AI experts who are working on
>>> ORES.
>>>
>>> Pine
>>> ___
>>> Wikitech-l mailing list
>>> wikitec...@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>> ___
>> Wikitech-l mailing list
>> wikitec...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wikipedia and SMEs; another article about the Stanford Encyclopedia of Philosophy

2016-05-23 Thread Oliver Keyes

It did, yes, but that wasn't it's primary focus - AFT is an example of
expert engagement in the same way it's an example of PHP: sure it uses
it but that's not necessarily what comes to mind when you think of it.

(I appreciate I've left myself open to quite a lot of comments about
precisely what does come to mind for people when they think of AFT.
Mostly obscenities, I suspect.)

I quite like the GLAM+STEM idea - is it being discussed on a list
somewhere? (Absent here, which may not be the right location.)

On Mon, May 23, 2016 at 2:30 PM, Pine W  wrote:
> AFT did try to engage readers, but if I recall correctly it had a checkbox
> saying something like "I am an expert on this subject and I want to provide
> feedback." This is reaching far back in my hazy memory, but I think that
> similar features were present in both AFT3 and AFT5.
>
> That's an interesting idea about getting GLAM to focus on review in addition
> to content creation. FloNight and I have also been talking about expanding
> the GLAM concept to what I'm calling GLAM+STEM, meaning that we're
> interested in engaging STEM institutions as well as GLAM institutions in
> content creation (and potentially content quality review.)
>
> Pine
>
> On Mon, May 23, 2016 at 11:17 AM, WereSpielChequers
>  wrote:
>>
>> I thought AFT was an attempt to engage readers not Subject Matter Experts.
>>
>> In my experience two of our most effective ways to outreach to those
>> experts who are not already in the community are the GLAM program and
>> potentially the education program.
>>
>> This was one of the areas that Johnbod explored in his time as Wikimedians
>> in Residence at Cancer Research UK. You might want to talk to him as to how
>> that went and the extent to which it could be replicated. The focus of a lot
>> of residents has been more on getting openly licensed digital material, but
>> I don't see why we couldn't have more residencies focussed on expert review,
>> providing of course that the articles in that area are already at a stage
>> worthy of review.
>>
>>
>>
>>
>>
>> On 23 May 2016 at 18:34, Pine W  wrote:
>>>
>>> Another article on the Stanford Encyclopedia of Philosophy. [1] I wonder,
>>> could any of the practices described here be implemented on Wikipedia in a
>>> way that would be helpful? WMF tried to engage SMEs through the now
>>> mothballed AFT, and I believe that there is an ongoing effort to get SME
>>> comments with the assistance of a bot facilitating communications from SMEs
>>> to article talk pages (Aaron, do you remember the name of that project, and
>>> if so could we get an update about it?)
>>>
>>> Thanks,
>>> Pine
>>>
>>> [1]
>>> http://qz.com/480741/this-free-online-encyclopedia-has-achieved-what-wikipedia-can-only-dream-of/
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] design patterns for peer learning and peer production, with wikimedia case study - preprint

2015-12-28 Thread Oliver Keyes

On 28 December 2015 at 10:03, Joe Corneli <holtzerman...@gmail.com> wrote:
> On Mon, Dec 28 2015, Oliver Keyes wrote:
>
>> My big question is how these pedagogic maps factor in the negatives of
>> peer production communities - harassment, toxicity - and route around
>> or solve for them.
>
> Hi Oliver,
>
> Thanks for the speedy and thought-provoking reply!
>
> The question above is a good one.  We did have a basic collection of
> "antipatterns", but didn't develop them in this paper, because thinking
> about antipatterns adds some complexity and we wanted to get the
> "positive" vision more firmly in mind first.  With that accomplished,
> I'd love to write a sequel sometime about "Antipatterns of Peeragogy"!
>

Cool! This makes sense and is one of the concerns I've heard about
including antipatterns and patterns together; that it leads to claims
of a work "lacking focus". I would argue (just for myself, and
editorial boards probably feel very very differently) that not
including antipatterns makes a design pattern or template of limited
applicability and so said editorial boards should be approving of it -
but that's, again, just for me ;p.

> Still, the current catalog should definitely help surface and do
> something about concerns.  The strategy would be something like: start
> with the Scrapbook pattern and existing critiques, develop a short list
> of criticisms into A specific project, and build a Roadmap that involves
> others in addressing the issue that was identified.
>
> A recent thread kicked off by Pine seems to be an example along those
> lines:
> https://lists.wikimedia.org/pipermail/wiki-research-l/2015-December/004927.html
>
>> I do wonder about the generalisability of some of the examples; in
>> particular while Wikiprojects are _ideally_ a good starting point for
>> a lot of newcomers I don't have the data to hand about whether, in
>> practice, it is the starting point for a large proportion of users,
>> and I don't see citations to that effect in your paper (although I do
>> see the claim).  It would be good if someone more informed about this
>> particular question than I could chip in with what they've
>> measured/observed in detail (I know some people have been studying
>> Wikiprojects specifically, particularly James Hare)
>
> I've been impressed with some of my own earlier common-sensical
> guesswork that turned out not to hold water, and accordingly have tried
> to be careful to cite or footnote the Wikimedia evidence, but indeed
> that is one of the intuitive claims that is ^[citation needed].  Even
> though there are "many" users involved with Wikiprojects, the population
> might be oldtimers rather than new users.  I'll look around a bit more,
> and/or adjust the claim to focus on current population of Wikiproject
> contributors rather than on the hypothesis that the projects are used
> for wiki onramping.
>

Yeah; from my own subjective experiences it's more oldtimers than
newtimers, but this may also be
common-sensical-but-not-holiding-water!
> Joe
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] design patterns for peer learning and peer production, with wikimedia case study - preprint

2015-12-28 Thread Oliver Keyes

Hey Joe,

My big question is how these pedagogic maps factor in the negatives of
peer production communities - harassment, toxicity - and route around
or solve for them.

The inclusion of carrying capacity, and explicit recognition of the
costs of labour overall, is great to see. But I would love to see
roadmaps that factor in the "dark side" here, and the specific
emotional labour costs of dealing with that dark side.

Without factoring those things in, the practical utility of the
roadmaps - outside of publishing - is likely to be somewhat
constrained and difficult to scale. And in a year where we have
learned more and more about the costs around a lot of collaborative
and communicative environments, from Wikipedia to Twitter, including
these things (or recognising them) is really not optional. I don't see
it discussed in your work (I admit that I may have just missed it, and
please let me know if so!)

The patterns themselves are excellent, however, and I really like the
structure of the work. I do wonder about the generalisability of some
of the examples; in particular while Wikiprojects are _ideally_ a good
starting point for a lot of newcomers I don't have the data to hand
about whether, in practice, it is the starting point for a large
proportion of users, and I don't see citations to that effect in your
paper (although I do see the claim).  It would be good if someone more
informed about this particular question than I could chip in with what
they've measured/observed in detail (I know some people have been
studying Wikiprojects specifically, particularly James Hare)

On 28 December 2015 at 09:17, Joe Corneli <holtzerman...@gmail.com> wrote:
>
> http://metameso.org/~joe/docs/peeragogy_pattern_catalog_proceedings.pdf
>
> is a preprint of the paper "Patterns of Peeragogy" to appear in
> Proceedings of Pattern Languages of Programs 2015.
>
> Abstract: We describe nine design patterns that we have developed in our
> work on the Peeragogy project, in which we aim to help design the future
> of learning, inside and outside of institutions. We use these patterns
> to build an “emergent roadmap” for the project.
>
> This paper may be of interest to people here, particularly since we
> trace through the ways in which the patterns manifest in Wikimedia
> projects.
>
> The final revision is due January 15th so comments before then still
> have a chance to improve the final document.
>
> When it appears, the bibtex citation will be:
>
> @inproceedings{patterns-of-peeragogy,
> title={Patterns of {P}eeragogy},
> author={Corneli, Joseph and Danoff, Charles Jeffrey and Pierce, Charlotte and 
> Ricuarte, Paola and Snow MacDonald, Lisa},
> booktitle={Pattern {L}anguages of {P}rograms {C}onference 2015 ({PLoP'15}), 
> {P}ittsburgh, {PA}, {USA}, {O}ctober 24-26, 2015},
> editor={Correia, Filipe},
> year={2015},
> publisher={ACM}}
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] What Wikimedia Research is up to in the next quarter

2015-12-21 Thread Oliver Keyes

Awesome; thanks!

On 21 December 2015 at 12:12, Leila Zia <le...@wikimedia.org> wrote:
> Hi Oliver,
>
> On Sat, Dec 19, 2015 at 12:01 AM, Oliver Keyes <oke...@wikimedia.org> wrote:
>>
>> So what's happening with the link recommendation system? Is that
>> rolled into article-creation recommendations, or was the paper the
>> final product?
>
>
> The short answer is: the paper is ideally not the final product (although
> I'm thrilled with the reviews we received for it) as we like to see this
> system used.
>
> From the research perspective, we want to have a tool where we can collect
> data to learn more about and improve the link recommendation system. We've
> had extensive conversations with Pau about this. The model we're working
> towards is a landing page where the user can get different types of
> recommendations: link recommendations* and article-creation recommendations,
> and maybe more forms of recommendations in the future. In the past three
> months, Ashwin has worked closely with Pau, Nirzar, and Ed to bring us
> closer to having such a tool. The tool is not ready to be used yet (really!
> you'll see it as soon as you click Add Links), but if you're curious to see
> where we are with it, please check
> http://tools.wmflabs.org/navlink-recommendation/
>
> From the product perspective, the Editing team has set their goal to test
> the tool (via the same tool on wmflabs) and if successful, figuring out the
> next steps for it. Please reach out to the team directly if you like to know
> more.
>
> Best,
> Leila
>
> * Initially, the tool was going to have link recommendations for two cases:
> where the anchor text existed in the article text, and where it didn't. We
> learned that when the anchor text does not exist, the task becomes a much
> harder one from the user's perspective, since the decision is no longer
> whether the link should be added or not but where it should be added and in
> what context. The tool that you see now will only have link recommendations
> where the anchor text exists.
>
>>
>>
>> On 19 December 2015 at 01:53, Dario Taraborelli
>> <dtarabore...@wikimedia.org> wrote:
>> >
>> > On Dec 18, 2015, at 10:13 PM, Gerard Meijssen
>> > <gerard.meijs...@gmail.com>
>> > wrote:
>> >
>> > Hoi,
>> > Where does it say what languages are covered
>> >
>> >
>> >
>> > https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service#Support_table
>> >
>> > and, what languages are planned for support?
>> >
>> >
>> >
>> > https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Progress_report:_2015-11-28
>> >
>> > although what gets in production will depend on many factors, such as
>> > community support to generate labeled data, performance of the model
>> > etc.
>> >
>> > Dario
>> >
>> > Thanks,
>> >  GerardM
>> >
>> > On 19 December 2015 at 05:16, Dario Taraborelli
>> > <dtarabore...@wikimedia.org>
>> > wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I’m glad to announce that the Wikimedia Research team’s goals for the
>> >> next
>> >> quarter (January - March 2016) are up on wiki.
>> >>
>> >> The Research and Data team will continue to work with our volunteers
>> >> and
>> >> collaborators on revision scoring as a service adding support for 5 new
>> >> languages and prototyping new models (including an edit type
>> >> classifier). We
>> >> will also continue to iterate on the design of article creation
>> >> recommendations, running a dedicated campaign in coordination with
>> >> existing
>> >> editathons to improve the quality of these recommendations. Finally, we
>> >> will
>> >> extend a research project we started in November aimed at understanding
>> >> the
>> >> behavior of Wikipedia readers, by combining qualitative survey data
>> >> with
>> >> behavioral analysis from our HTTP request logs.
>> >>
>> >> The Design Research team will conduct an in-depth study of user needs
>> >> (particularly readers) on the ground in February. We will continue to
>> >> work
>> >> with other Wikimedia Engineering teams throughout the quarter to ensure
>> >> the
>> >> adoption of human-centered design principles and pragmatic personas in
>> >> our
>> >> product d

Re: [Wiki-research-l] What Wikimedia Research is up to in the next quarter

2015-12-19 Thread Oliver Keyes

So what's happening with the link recommendation system? Is that
rolled into article-creation recommendations, or was the paper the
final product?

On 19 December 2015 at 01:53, Dario Taraborelli
<dtarabore...@wikimedia.org> wrote:
>
> On Dec 18, 2015, at 10:13 PM, Gerard Meijssen <gerard.meijs...@gmail.com>
> wrote:
>
> Hoi,
> Where does it say what languages are covered
>
>
> https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service#Support_table
>
> and, what languages are planned for support?
>
>
> https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Progress_report:_2015-11-28
>
> although what gets in production will depend on many factors, such as
> community support to generate labeled data, performance of the model etc.
>
> Dario
>
> Thanks,
>  GerardM
>
> On 19 December 2015 at 05:16, Dario Taraborelli <dtarabore...@wikimedia.org>
> wrote:
>>
>> Hey all,
>>
>> I’m glad to announce that the Wikimedia Research team’s goals for the next
>> quarter (January - March 2016) are up on wiki.
>>
>> The Research and Data team will continue to work with our volunteers and
>> collaborators on revision scoring as a service adding support for 5 new
>> languages and prototyping new models (including an edit type classifier). We
>> will also continue to iterate on the design of article creation
>> recommendations, running a dedicated campaign in coordination with existing
>> editathons to improve the quality of these recommendations. Finally, we will
>> extend a research project we started in November aimed at understanding the
>> behavior of Wikipedia readers, by combining qualitative survey data with
>> behavioral analysis from our HTTP request logs.
>>
>> The Design Research team will conduct an in-depth study of user needs
>> (particularly readers) on the ground in February. We will continue to work
>> with other Wikimedia Engineering teams throughout the quarter to ensure the
>> adoption of human-centered design principles and pragmatic personas in our
>> product development cycle. We’re also excited to start a collaboration with
>> students at the University of Washington to understand what free online
>> information resources (including, but not limited to, Wikimedia projects)
>> students use.
>>
>> I am also glad to report that two papers on link and article
>> recommendations (the result of a formal collaboration with a team at
>> Stanford) were accepted for presentation at WSDM '16 and WWW ’16 (preprints
>> will be made available shortly). An overview on revision scoring as a
>> service was published a few weeks ago on the Wikimedia blog, and got some
>> good media coverage.
>>
>> We're constantly looking for contributors and as usual we welcome feedback
>> on these projects via the corresponding talk pages on Meta. You can contact
>> us for any question on IRC via the #wikimedia-research channel and follow
>> @WikiResearch on Twitter for the latest Wikipedia and Wikimedia research
>> updates hot off the press.
>>
>> Wishing you all happy holidays,
>>
>> Dario and Abbey on behalf of the team
>>
>>
>> Dario Taraborelli  Head of Research, Wikimedia Foundation
>> wikimediafoundation.org • nitens.org • @readermeter
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Community policing, New Page Patrol, Articles for Creation, and editor retention

2015-12-15 Thread Oliver Keyes

We can probably talk about the nature of new page patrol without
resorting to comparisons to violent, real-world overreactions with
multiple serious injuries.

To be perfectly honest as a new page patroller the biggest issue I've
seen is toxic senior members of the community making the prospect of
patrolling particularly unpleasant. It doesn't do much for patroller
numbers.

On 15 December 2015 at 18:28, Pine W <wiki.p...@gmail.com> wrote:
> Yesterday I gave a presentation about community policing at the Cascadia
> Wikimedians' end of year event with Seattle TA3M [1][2][3]. An issue that
> came up for discussion is the extent to which, on English Wikipedia,
> experienced Wikipedians conducting New Page Patrol create collateral damage
> during their well-intentioned efforts to protect Wikipedia. Another subject
> that came up is the need for more human resources for mentoring of newbies
> who create articles using the Articles for Creation system [4]; one comment
> I've heard previously is that the length of time between submission and
> review may be long enough for the newbie to give up and disappear, and
> another comment that I've heard is that newbies may not understand the
> instructions that they're given when their article is reviewed. These
> comments correlate with the community SWOT analysis that was done at
> WikiConference USA this year, in which "biting the newbies", NPP, and
> "onboarding/training" were identified as weaknesses [5]
>
> Personally, I would like the interaction of experienced editors with the
> newbies in places like NPP and AFC to look more like this and less like
> this. Granted, it's hard for a relatively small number of experienced
> Wikipedians to keep all the junk and vandals out while also mentoring the
> newbies and avoiding collateral damage, so one strategy could be to increase
> the quantity of skilled human resources that are devoted to these domains.
> Any thoughts on how to make that happen?
>
> I am currently especially interested in this topic because of my IEG project
> which officially starts this week. [6] It would be very helpful to retain
> the new editors that are trained through these videos, so improving editor
> retention via improved newbie experiences at NPP and/or AFC would be most
> welcome.
>
> Pine
>
> [1] https://en.wikipedia.org/wiki/Community_policing
> [2] https://en.wikipedia.org/wiki/Police_reform_in_the_United_States
> [3]
> https://commons.wikimedia.org/wiki/File:Presentations_at_Cascadia_Wikimedians_and_Seattle_TA3M_meetup,_December_2015.jpg
> [4] https://en.wikipedia.org/wiki/Wikipedia:Articles_for_creation
> [5]
> https://commons.wikimedia.org/wiki/File:SWOT_analysis_of_Wikipedia_in_2015.jpg
> [6]
> https://meta.wikimedia.org/wiki/Grants:IEG/Motivational_and_educational_video_to_introduce_Wikimedia
>
>
> _______
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Community policing, New Page Patrol, Articles for Creation, and editor retention

2015-12-15 Thread Oliver Keyes

Well, we don't really have a judicial approach either; judges get
booted when they're biased or refusing to apply the law ;). I would
agree that it is a small circle of people, and I would agree that they
have a far larger impact than numbers would suggest. Community
Advocacy is currently running a harassment consultation at
https://meta.wikimedia.org/wiki/Harassment_consultation_2015 - I
suggest looking at the proposals there.

On 15 December 2015 at 19:00, Pine W <wiki.p...@gmail.com> wrote:
> Maybe it's just the circles that I happen to circulate in, but it seems to
> me that a very small percentage of Wikipedians tend to be consistently harsh
> or toxic, and that small number of people tends to have disproportionately
> negative influence on the atmosphere in the community. Aligned with Jimbo's
> comments at Wikimania 2014 in London, I do wonder if their caustic nature
> rises to the level where they should be excluded from the community, and if
> so, on what grounds we would make that exclusion. Being a relentless critic
> doesn't necessarily rise to the level of harassment if it's done broadly
> rather than directed at a particular individual or group, but looking at the
> problem from an HR perspective rather than a judicial one, I agree that
> maybe more should be done to exclude toxic personalities. I wonder, though,
> how we can do that; our process for excluding people from the community is
> more like a judicial process than like an HR process. Maybe we need more of
> an HR approach?
>
> Pine
>
> On Tue, Dec 15, 2015 at 3:51 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
>>
>> We can probably talk about the nature of new page patrol without
>> resorting to comparisons to violent, real-world overreactions with
>> multiple serious injuries.
>>
>> To be perfectly honest as a new page patroller the biggest issue I've
>> seen is toxic senior members of the community making the prospect of
>> patrolling particularly unpleasant. It doesn't do much for patroller
>> numbers.
>>
>> On 15 December 2015 at 18:28, Pine W <wiki.p...@gmail.com> wrote:
>> > Yesterday I gave a presentation about community policing at the Cascadia
>> > Wikimedians' end of year event with Seattle TA3M [1][2][3]. An issue
>> > that
>> > came up for discussion is the extent to which, on English Wikipedia,
>> > experienced Wikipedians conducting New Page Patrol create collateral
>> > damage
>> > during their well-intentioned efforts to protect Wikipedia. Another
>> > subject
>> > that came up is the need for more human resources for mentoring of
>> > newbies
>> > who create articles using the Articles for Creation system [4]; one
>> > comment
>> > I've heard previously is that the length of time between submission and
>> > review may be long enough for the newbie to give up and disappear, and
>> > another comment that I've heard is that newbies may not understand the
>> > instructions that they're given when their article is reviewed. These
>> > comments correlate with the community SWOT analysis that was done at
>> > WikiConference USA this year, in which "biting the newbies", NPP, and
>> > "onboarding/training" were identified as weaknesses [5]
>> >
>> > Personally, I would like the interaction of experienced editors with the
>> > newbies in places like NPP and AFC to look more like this and less like
>> > this. Granted, it's hard for a relatively small number of experienced
>> > Wikipedians to keep all the junk and vandals out while also mentoring
>> > the
>> > newbies and avoiding collateral damage, so one strategy could be to
>> > increase
>> > the quantity of skilled human resources that are devoted to these
>> > domains.
>> > Any thoughts on how to make that happen?
>> >
>> > I am currently especially interested in this topic because of my IEG
>> > project
>> > which officially starts this week. [6] It would be very helpful to
>> > retain
>> > the new editors that are trained through these videos, so improving
>> > editor
>> > retention via improved newbie experiences at NPP and/or AFC would be
>> > most
>> > welcome.
>> >
>> > Pine
>> >
>> > [1] https://en.wikipedia.org/wiki/Community_policing
>> > [2] https://en.wikipedia.org/wiki/Police_reform_in_the_United_States
>> > [3]
>> >
>> > https://commons.wikimedia.org/wiki/File:Presentations_at_Cascadia_Wikimedians_and_Seattle_TA3M_meetup,_December_2015.jpg
>> > [4] https://en.wikipedia.o

Re: [Wiki-research-l] Community policing, New Page Patrol, Articles for Creation, and editor retention

2015-12-15 Thread Oliver Keyes

Well, what is and isn't a reliable source is discussed at various
noticeboards and set into stone, so it's more like saying "you
published this in a journal on Beall's list"

On 15 December 2015 at 20:35, Kerry Raymond <kerry.raym...@gmail.com> wrote:
> I agree with Pine. It’s often patterns of behaviour that are more
> significant than some individual incident. The drip-drip-drip of constant
> criticism from a colleague can wear out most people. And if it’s done with
> AWB or other tool, it’s very easy to grind down other people down,
> especially as most people don’t know what ways they have to complain about
> such behaviour and, in any case, most complaints have to lodged on-wiki
> (which presumably discourages most people from doing it). Why do we allow
> the bullies to write the rules of this playground?
>
>
>
> For example, there is a user account that removes the word “comprises”, a
> word their user page says they don’t like for various reasons (but none of
> which appear to relate to Wikipedia policy) . Why is this one user through
> their persistence allowed to decide what words are used in Wikipedia
> articles? Another bully (and I can see no other way to describe their
> behaviour) has a long edit history full of reversions with the edit summary
> “no source provided” or “not a reliable source” (which seems to be something
> you can say about just about anything – rather like the way you can
> criticise most research with “but, with a larger longer study, it might show
> different results?”).
>
>
>
> Kerry
>
>
>
>
>
> From: Wiki-research-l [mailto:wiki-research-l-boun...@lists.wikimedia.org]
> On Behalf Of Pine W
> Sent: Wednesday, 16 December 2015 10:11 AM
> To: Research into Wikimedia content and communities
> <wiki-research-l@lists.wikimedia.org>
> Subject: Re: [Wiki-research-l] Community policing, New Page Patrol, Articles
> for Creation, and editor retention
>
>
>
> The problems that I'm contemplating here are, for better and for worse,
> outside the scope of what I would consider harassment. I think that they
> could be described as toxic interactions in general, and/or a shortage of or
> long-delayed positive interactions at places like NPP and AFC.
>
> Pine
>
>
>
> On Tue, Dec 15, 2015 at 4:02 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
>
> Well, we don't really have a judicial approach either; judges get
> booted when they're biased or refusing to apply the law ;). I would
> agree that it is a small circle of people, and I would agree that they
> have a far larger impact than numbers would suggest. Community
> Advocacy is currently running a harassment consultation at
> https://meta.wikimedia.org/wiki/Harassment_consultation_2015 - I
> suggest looking at the proposals there.
>
>
> On 15 December 2015 at 19:00, Pine W <wiki.p...@gmail.com> wrote:
>> Maybe it's just the circles that I happen to circulate in, but it seems to
>> me that a very small percentage of Wikipedians tend to be consistently
>> harsh
>> or toxic, and that small number of people tends to have disproportionately
>> negative influence on the atmosphere in the community. Aligned with
>> Jimbo's
>> comments at Wikimania 2014 in London, I do wonder if their caustic nature
>> rises to the level where they should be excluded from the community, and
>> if
>> so, on what grounds we would make that exclusion. Being a relentless
>> critic
>> doesn't necessarily rise to the level of harassment if it's done broadly
>> rather than directed at a particular individual or group, but looking at
>> the
>> problem from an HR perspective rather than a judicial one, I agree that
>> maybe more should be done to exclude toxic personalities. I wonder,
>> though,
>> how we can do that; our process for excluding people from the community is
>> more like a judicial process than like an HR process. Maybe we need more
>> of
>> an HR approach?
>>
>> Pine
>>
>> On Tue, Dec 15, 2015 at 3:51 PM, Oliver Keyes <oke...@wikimedia.org>
>> wrote:
>>>
>>> We can probably talk about the nature of new page patrol without
>>> resorting to comparisons to violent, real-world overreactions with
>>> multiple serious injuries.
>>>
>>> To be perfectly honest as a new page patroller the biggest issue I've
>>> seen is toxic senior members of the community making the prospect of
>>> patrolling particularly unpleasant. It doesn't do much for patroller
>>> numbers.
>>>
>>> On 15 December 2015 at 18:28, Pine W <wiki.p...@gmail.com> wrote:
>>> > Yesterday I gave a presentation

[Wiki-research-l] R client for the new Pageviews API

2015-11-17 Thread Oliver Keyes

Hey!

As y'all may have seen, we have a new pageviews API, with much finer
granularity and better recall than the existing data. Since I had
advance notice of the release, I was able to put together an R client
already - you can get it at https://github.com/Ironholds/pageviews if
R is your language of choice, and it'll be up on CRAN shortly.

Thanks,

-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Analytics] R client for the new Pageviews API

2015-11-17 Thread Oliver Keyes

Huh, did Dan not send it to the research list? Curses! See
https://lists.wikimedia.org/pipermail/analytics/2015-November/004529.html

On 17 November 2015 at 22:12, Giovanni Luca Ciampaglia
<gciam...@indiana.edu> wrote:
> Interesting! I didn't know a new API had been released :-)
>
> When can I find more documentation about it?
>
> Cheers,
>
> G
>
>
> Giovanni Luca Ciampaglia ∙ Assistant Research Scientist, Indiana University
>
>
> On Tue, Nov 17, 2015 at 9:55 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
>>
>> Shall do! I'm already linking in the internal documentation :)
>>
>> On 17 November 2015 at 21:11, Madhumitha Viswanathan
>> <mviswanat...@wikimedia.org> wrote:
>> > Woot! Nice :) Would be cool to link to the API docs from your README
>> > too.
>> >
>> > On Tue, Nov 17, 2015 at 5:54 PM, Oliver Keyes <oke...@wikimedia.org>
>> > wrote:
>> >>
>> >> Hey!
>> >>
>> >> As y'all may have seen, we have a new pageviews API, with much finer
>> >> granularity and better recall than the existing data. Since I had
>> >> advance notice of the release, I was able to put together an R client
>> >> already - you can get it at https://github.com/Ironholds/pageviews if
>> >> R is your language of choice, and it'll be up on CRAN shortly.
>> >>
>> >> Thanks,
>> >>
>> >> --
>> >> Oliver Keyes
>> >> Count Logula
>> >> Wikimedia Foundation
>> >>
>> >> ___
>> >> Analytics mailing list
>> >> analyt...@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>> >
>> >
>> >
>> > --
>> > --Madhu :)
>> >
>> > ___
>> > Analytics mailing list
>> > analyt...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>>
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Analytics] R client for the new Pageviews API

2015-11-17 Thread Oliver Keyes

Shall do! I'm already linking in the internal documentation :)

On 17 November 2015 at 21:11, Madhumitha Viswanathan
<mviswanat...@wikimedia.org> wrote:
> Woot! Nice :) Would be cool to link to the API docs from your README too.
>
> On Tue, Nov 17, 2015 at 5:54 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
>>
>> Hey!
>>
>> As y'all may have seen, we have a new pageviews API, with much finer
>> granularity and better recall than the existing data. Since I had
>> advance notice of the release, I was able to put together an R client
>> already - you can get it at https://github.com/Ironholds/pageviews if
>> R is your language of choice, and it'll be up on CRAN shortly.
>>
>> Thanks,
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>> ___
>> Analytics mailing list
>> analyt...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> --
> --Madhu :)
>
> _______
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Verifying claims about ENWP project size

2015-09-16 Thread Oliver Keyes

stinfo/wiki-research-l
>>>
>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Senior Design Researcher
>> Wikimedia Foundation
>> User:Jmorgan (WMF)
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Has the recent increase in English wikipedia's core community gone beyond a statistical blip?

2015-08-24 Thread Oliver Keyes

Until we can prove it is good data we should treat it as good data
is not how data works.

Absent exactly that analysis it is almost certainly a bad idea for us
to declare this to be good news; validate, /then/ celebrate.

On 24 August 2015 at 12:26, WereSpielChequers
werespielchequ...@gmail.com wrote:
 100 edits a month does indeed have the disadvantage that all edits are not
 equal, there may be some people for whom that represents 100 hours
 contributed, others a single hour. So an individual month could be inflated
 by something as trivial as a vandalfighting bot going down for a couple of
 days and a bunch of oldtimers responding to a call on IRC by coming back and
 running huggle for an hour.

 But 7 months in a row where the total is higher than the same month the
 previous year looks to me like a pattern.

 Across the 3,000 or so editors on English wikipedia who contribute over a
 hundred edits per month there could be a hidden pattern of an increase in
 Huggle, stiki and AWB users more than offsetting a decline in manual
 editing, but unless anyone analyses that and reruns those stats on some
 metric such as unique calender hours in which someone saves an edit I
 think it best to treat this as an imperfect indicator of community health.
 I'm not suggesting that we are out of the woods - there are other indicators
 that are still looking bad, and I would love to see a better proxy for
 active editors. But this is good news.



 On 23 August 2015 at 19:31, Mark J. Nelson m...@anadrome.org wrote:

 WereSpielChequers werespielchequ...@gmail.com writes:

  Could you be more specific re In general I'm not sure the 100+ count is
  among the most reliable. What in particular do you think is unreliable
  about that metric?

 The main thing I have questions about with that metric is whether it's a
 good proxy for editing activity in general, or is dominated by
 fluctuations in bookkeeping contributions, i.e. people doing
 mass-moves of categories and that kind of thing (which makes it quite
 easy to get to 100 edits). This has long been a complaint about edit
 counts as a metric, which have never really been solidly validated.

 Looking through my own personal editing history, it looks like there's
 an anti-correlation between hitting the 100-edit threshold and making
 more substantial edits. In months when I work on article-writing I
 typically have only 20-30 edits, because each edit takes a lot of
 library research, so I can't make more than one or two a day. In months
 where I do more bookkeeping-type edits I can easily have 500 or 1000
 edits.

 But that's just for me; it's certainly possible that Wikipedia-wide,
 there's a good correlation between raw edit count and other kinds of
 desirable activity measures. But is there evidence of that?


 --
 Mark J. Nelson
 Anadrome Research
 http://www.kmjn.org

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] How to read blobs in text table?

2015-07-29 Thread Oliver Keyes

If we're talking Wikimedia Mediawiki instances, yes, the API is your
only way forward - for performance reasons the text content is stored
in a totally different set of servers that (to my knowledge) even paid
researchers don't get to mess around with. Alternately you could take
a look at https://dumps.wikimedia.org if slightly outdated information
is okay to you.

On 29 July 2015 at 18:58, Srijan Kumar srijanke...@gmail.com wrote:
 Hi!

 I want to read the text stored in the text tables[1], but the old_text field
 stores it as what seems to be the path to the blob. How can I get the
 content of the blob?

 Alternately, is there any other way to access all text content (including
 deleted content) without requiring global rights to the API?

 Thanks!
 Srijan

 [1] https://www.mediawiki.org/wiki/Manual:Text_table

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Trying to find a paper..

2015-07-28 Thread Oliver Keyes

If anyone has a copy of Rehurek  Kolkus's Language Identification on
the Web: Extending the Dictionary Method from 2009, could they send
it to me?

Thanks!

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Trying to find a paper..

2015-07-28 Thread Oliver Keyes

Woah, now resolved!

This community is awesome :D

On 28 July 2015 at 15:01, Oliver Keyes oke...@wikimedia.org wrote:
 If anyone has a copy of Rehurek  Kolkus's Language Identification on
 the Web: Extending the Dictionary Method from 2009, could they send
 it to me?

 Thanks!

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Aidez à améliorer l'exhaustivité de Wikipédia en français

2015-06-26 Thread Oliver Keyes

 have an account with the same username in the source
 language, have made at least one edit in both the source and target
 Wikipedias, have made at least one edit in either language within the last
 year and have matching email addresses for the two accounts.


 Based on the feedback from the test, it is clear that we need to raise the
 bar on the contributions to source/destination languages for the future
 steps. We initially had a 100 byte limit in each of the source and
 destination language in the past year as a bar, but that one somehow didn't
 get to the code (code issue) and we didn't realize this until we received
 the feedback. Based on the feedback, we may want to consider even higher
 bars for choosing editors, one thing we do not want to do is to ignore those
 with few edits completely. Those may be people who have contributed few
 times and recommendations can encourage them to contribute more and come
 back. Any feedback on how we can improve this aspect further is appreciated.

 On Fri, Jun 26, 2015 at 6:58 AM, Samuel Klein meta...@gmail.com wrote:


 Interesting, I figured I received the mail because of joining translation
 projects.   It seems that it's enough to have made a single edit in both
 language wikipedias in the last year.


 we changed the wording of the page to make it clearer. I think there was a
 confusion caused by our wording. please read here.


 I hope you will do this in both directions for each language pair (both
 suggestions from FR -- EN and from EN -- FR.)


 the way the algorithm makes the final recommendations is language agnostic
 so we can easily expand them to other language pairs. the goal is to have
 them for the top 30 languages (to and from), the top 50 if we have enough
 data to make good enough recommendations. We do hope that the engineering
 aspect of receiving these recommendations can also move as fast so we can
 offer the editors the recommendations in a way that works smoothly with
 their workflow.

 On Fri, Jun 26, 2015 at 8:32 AM, Jim tro...@gmail.com wrote:

 I strongly disagree that this is spamming. Like others have mentioned, I
 was not offended by the email (though I wasn't delighted) by it either, I
 think it is a reasonable attempt to encourage editors to put some efforts
 into languages other than English.

 Plus it is easy to unsubscribe from the research mailing list.


 Thanks for sharing your point of view and happy to hear we did not bother
 you by it. As mentioned earlier, I hope that we (all parties involved, not
 just research) can resolve the email conversation in a way that more people
 are happier with the outcome.

 On Fri, Jun 26, 2015 at 9:38 AM, Ziko van Dijk zvand...@gmail.com wrote:

 Spamming - a question what the e-mail function of WP is ment for.
 I was very surprised to get the request though my French is limited, I
 hardly ever edited on fr.WP,


 The feedback about limited French language knowledge is a great feedback
 that we have heard clearly. Thank you for sharing that and sorry that you
 were chosen. This is something we have already changed in our code to
 increase the threshold on the way we choose future participants.


 and the suggested topics have totally nothing to do with what I do on
 Wikipedia. So I do think that the mail was not quite appropriate, and it
 gives me a not so favorable impression about the people or initiative
 behind.


 I'm sorry if the recommendation has disappointed you. As mentioned in the
 recommendation email, you will be in one of the two groups: those who
 receive random but still important (with the algorithm's definition of
 importance) recommendations or those who receive personalized and important
 recommendations. Since we have not finalized the analysis of the test I
 cannot look to see which group you were in since that may have impact on the
 results. I hope this helps us build more trust, and hopefully we can learn
 much more when the results are out. Thank you for your time.

 Thanks again everyone. I will continue monitoring this list. We are also
 busy with the talk page so you may experience some delay. Apologize in
 advance if that happens. Just be sure that we will get back to you. :-)

 Best,
 Leila

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Anti-harassment policies within the R community

2015-06-06 Thread Oliver Keyes

Hey all,

This isn't directly WP-related, except in the sense that it relates to
a toolset I think a lot of researchers in HCI tend to use (heck, a lot
of researchers, full stop)

As any of you who've had the misfortune to spend more than 5 minutes
talking to me in the last few years will know, I'm kind of fanatical
about the R programming language. It's commonly used within
statistical analysis and even Python users wander over to R land for
graphing ;).

While the Python Software Foundation has an anti-harassment policy for
conferences they run or sponsor, the R Foundation does not - and so
we've started an open letter asking for one to be instituted. The
letter lives at
https://docs.google.com/document/d/1C1oPhup72lPHJXbpyJZNIo1BdCzfxZ_VoiWJWzmufjU/edit

If you're an R programmer or someone with an interest in that field,
and you're interested in signing, simply:

1. Put your name in as a suggested edit, in the format existing
signers are using;
2. Email me so I can confirm that the person signing as Foo Bar is
genuinely Foo Bar;
3. Done!

Many thanks,

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Community health (retitled thread)

2015-06-04 Thread Oliver Keyes

We should concentrate on factual data for research in a long email
about how everything is ruined forever because a moderator couldn't
find anything of value in an uncited claim that Jan-Bart actively
drove people away?

This must be what people mean by mixed methods ;)

On 4 June 2015 at 18:29, Juergen Fenn jf...@gmx.net wrote:

 Am 04.06.2015 um 19:11 schrieb Federico Leva (Nemo) nemow...@gmail.com:

 Reduced traffic on Wikimedia-l is mostly due to list moderation.

 That's plausible. Most people on wikimedia-l are moderated by now; I and 
 others unsubscribed due to tyrannical moderation, too.

 Well, not exactly tyrannical, as there obviously is a plan behind it not to 
 let anything critical sound on a list that no longer serves the community, 
 but that is just another tool for corporate communication.

 Of course, I also unsubscribed from the list, and I will read it from the 
 archive only because I don't subscribe to corporate communication lists.

 I agree to Aaron that we should concentrate on factual data for research. 
 However, I'd like to give you an idea of what nowadays is no longer possible 
 on Wikimedia-l because this also serves as an indicator for the health of 
 Wikipedia.

 The text of my censored email read: Jan-Bart, you might be aware that it was 
 you that drove many talented candidates out of the movement last year. So, 
 no more comment.

 And the moderator gave this comment for his decision: I could not find 
 anything positive in your message.  How would it help the goals of the 
 Wikimedia movement? Regards, Richard.

 That's the way Wikipedia works in 2015. So, no more comments about community 
 health or whoever's health. I'm off.

 Best,
 Jürgen.
 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Community health (retitled thread)

2015-06-04 Thread Oliver Keyes

Anecdata, but: as someone who no longer posts to wikimedia-l, I
stopped posting because I find it a fundamentally toxic place to be.

On 4 June 2015 at 13:55, Aaron Halfaker aaron.halfa...@gmail.com wrote:
 Hi Claudia,

  which of Juergen's statements do you mean?


 All of them, but mostly the explanation for the drop in traffic.

  do you have any evidence for the contrary?


 Please don't assume that my call for evidence suggests my disagreement.  I'm
 an empiricist and this is the research mailing list.  IMO, claims need
 evidence or should be carefully framed as speculation or hypothesis.   In
 this context, it is good practice to request that those making statements of
 fact produce justification.

 It seems like it would be helpful to move this conversation forward if
 someone were to find the dates of the policy change and compare it to the
 rate of moderated messages.

 I also suggest that any further discussion about WMF policies or board
 decisions (outside of their measurable effects, theoretical implications,
 etc.) be taken to a more appropriate forum.

 -Aaron

 On Thu, Jun 4, 2015 at 10:05 AM, koltzenb...@w4w.net wrote:

 Hi Aaron, which of Juergen's statements do you mean?

 my question is: do you have any evidence for the contrary?

 best, Claudia

 -- Original Message ---
 From:Aaron Halfaker aaron.halfa...@gmail.com
 To:Research into Wikimedia content and communities wiki-research-
 l...@lists.wikimedia.org
 Sent:Thu, 4 Jun 2015 09:55:02 -0500
 Subject:Re: [Wiki-research-l] Community health (retitled thread)

  Hi Juergen, That's an interesting hypothesis.  Do
  you have any evidence to support it?
 
  On Thu, Jun 4, 2015 at 9:50 AM, Juergen Fenn
  jf...@gmx.net wrote:
 
  
   Am 04.06.2015 um 16:33 schrieb Samuel Klein meta...@gmail.com:
  
Context: reduced traffic on wikimedia-l.
Is this a sign of poor community health?
  
   Reduced traffic on Wikimedia-l is mostly due to list moderation. All
   critical content has been filtered for a while. I became aware of it
   only
   recently when I posted a critical remark which was rejected.
  
   The Foundation has not only introduced Superprotect and Superban, it
 has
   also got a firm grip on all communication channels whatsoever. This
   will
   intensify as, we have just learned, new staff will be hired for
   communication. The WMF no longer needs the community, it does all the
   traffic itself (staff, chapters, etc.).
  
   Of course this shift from crowdsourcing to staff has to be paid for,
   hence
   the interest in an ever-increasing flow of donations and hence the
 interest
   in the Alexa ranking of Wikipedia.
  
   Best,
   Jürgen.
   ___
   Wiki-research-l mailing list
   Wiki-research-l@lists.wikimedia.org
   https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  
 --- End of Original Message ---


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Two articles of interest regarding Wikipedia research

2015-06-03 Thread Oliver Keyes

But the most interesting finding, at least for Greenstein, was that
Wikipedia articles with more revision in them had less bias and were
less likely to lean Democratic

IOW, articles that are genuinely crowdsourced are more neutral. Good
job, clickbait headline writer.

On 3 June 2015 at 16:48, Pine W wiki.p...@gmail.com wrote:
 Can Wikipedia Be Trusted?
 http://insight.kellogg.northwestern.edu/article/can-wikipedia-be-trusted

 The Unknown Perils of Mining Wikipedia
 https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

 Pine


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 117, Issue 14

2015-05-11 Thread Oliver Keyes

I can happily check the sampled logs for hits to those pages prior to and
on those dates, if that'd help?

On 11 May 2015 at 23:08, R.Stuart Geiger sgei...@gmail.com wrote:

 Going from 86,000,000 a month to 31,000 a month is quite a drop, and the
 shift is pretty dramatic. It goes from 1.7 million one day to 715 the next
 and stays flat (http://stats.grok.se/en/201410/Special:Random).

 I was also thinking there could be a bot or something that is scraping
 Special:Random, but the drop also happens for Special:Random/Talk -- which
 hardly anybody uses, but it still drops flat the same day (
 http://stats.grok.se/en/201410/Special:Random/Talk). It doesn't happen
 for Special:Upload or Special:Log though.

 October 16th, 2014 is the day it changes. Anybody know of something that
 might have changed that day with logging? Also, there have to be way more
 than ~1,000 hits a day to Special:Random. Perhaps pageviews started to be
 counted for the page that it got redirected to, rather than the
 Special:Random page itself. But then why wouldn't it go to 0? What are
 those ~1,000 hits a day?

 [image: ] ~~ it is a mystery ~~ [image: ]

 On Mon, May 11, 2015 at 7:44 PM, Oliver Keyes oke...@wikimedia.org
 wrote:

 A reduction or alteration in automata activity, possibly? Erik's dumps
 contain literally no filtering for scammers or crawlers, and we're a
 hot locale for spammer activity.

 On 11 May 2015 at 08:09, Alex Druk alex.d...@gmail.com wrote:
  I just grep monthly totals from Erik Zachte
  http://dumps.wikimedia.org/other/pagecounts-ez/merged/ (grep ^en.z
  Special:Random )
 
  On Mon, May 11, 2015 at 2:00 PM,
  wiki-research-l-requ...@lists.wikimedia.org wrote:
 
  Send Wiki-research-l mailing list submissions to
  wiki-research-l@lists.wikimedia.org
 
  To subscribe or unsubscribe via the World Wide Web, visit
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  or, via email, send a message with subject or body 'help' to
  wiki-research-l-requ...@lists.wikimedia.org
 
  You can reach the person managing the list at
  wiki-research-l-ow...@lists.wikimedia.org
 
  When replying, please edit your Subject line so it is more specific
  than Re: Contents of Wiki-research-l digest...
 
 
  Today's Topics:
 
 1. Re: How to explain drop in random searches (Oliver Keyes)
 
 
  --
 
  Message: 1
  Date: Sun, 10 May 2015 08:30:37 -0400
  From: Oliver Keyes oke...@wikimedia.org
  To: Research into Wikimedia content and communities
  wiki-research-l@lists.wikimedia.org
  Subject: Re: [Wiki-research-l] How to explain drop in random searches
  Message-ID:
 
  caauqgda6jvzgs3qqxvvgsh7mfthwxpd97uisdtjmjnugzzx...@mail.gmail.com
  Content-Type: text/plain; charset=UTF-8
 
  Using what data?
 
  On 10 May 2015 at 05:29, Alex Druk alex.d...@gmail.com wrote:
   Hi everyone,
  
  
  
   I try to learn dynamic of random searches (Special:Random) on English
   Wikipedia.
  
   From 01/2012 to 10/2014 average number of random searches per month
 was
   about 86 millions or about 30% of Main_Page pageviews, but from
 November
   2014 it drop to 31,000 per month (or 0.008% of Main_page).
  
   How to explain such a dramatic drop? Any ideas?
  
  
   --
   Thank you.
  
   Alex Druk, PhD
   wikipediatrends.com
   alex.d...@gmail.com
   (775) 237-8550 Google voice
  
   ___
   Wiki-research-l mailing list
   Wiki-research-l@lists.wikimedia.org
   https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  
 
 
 
  --
  Oliver Keyes
  Research Analyst
  Wikimedia Foundation
 
 
 
  --
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 
 
  End of Wiki-research-l Digest, Vol 117, Issue 14
  
 
 
 
 
  --
  Thank you.
 
  Alex Druk
  alex.d...@gmail.com
  (775) 237-8550 Google voice
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 



 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 117, Issue 14

2015-05-11 Thread Oliver Keyes

A reduction or alteration in automata activity, possibly? Erik's dumps
contain literally no filtering for scammers or crawlers, and we're a
hot locale for spammer activity.

On 11 May 2015 at 08:09, Alex Druk alex.d...@gmail.com wrote:
 I just grep monthly totals from Erik Zachte
 http://dumps.wikimedia.org/other/pagecounts-ez/merged/ (grep ^en.z
 Special:Random )

 On Mon, May 11, 2015 at 2:00 PM,
 wiki-research-l-requ...@lists.wikimedia.org wrote:

 Send Wiki-research-l mailing list submissions to
 wiki-research-l@lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 or, via email, send a message with subject or body 'help' to
 wiki-research-l-requ...@lists.wikimedia.org

 You can reach the person managing the list at
 wiki-research-l-ow...@lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than Re: Contents of Wiki-research-l digest...


 Today's Topics:

1. Re: How to explain drop in random searches (Oliver Keyes)


 --

 Message: 1
 Date: Sun, 10 May 2015 08:30:37 -0400
 From: Oliver Keyes oke...@wikimedia.org
 To: Research into Wikimedia content and communities
 wiki-research-l@lists.wikimedia.org
 Subject: Re: [Wiki-research-l] How to explain drop in random searches
 Message-ID:

 caauqgda6jvzgs3qqxvvgsh7mfthwxpd97uisdtjmjnugzzx...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 Using what data?

 On 10 May 2015 at 05:29, Alex Druk alex.d...@gmail.com wrote:
  Hi everyone,
 
 
 
  I try to learn dynamic of random searches (Special:Random) on English
  Wikipedia.
 
  From 01/2012 to 10/2014 average number of random searches per month was
  about 86 millions or about 30% of Main_Page pageviews, but from November
  2014 it drop to 31,000 per month (or 0.008% of Main_page).
 
  How to explain such a dramatic drop? Any ideas?
 
 
  --
  Thank you.
 
  Alex Druk, PhD
  wikipediatrends.com
  alex.d...@gmail.com
  (775) 237-8550 Google voice
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 



 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation



 --

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


 End of Wiki-research-l Digest, Vol 117, Issue 14
 




 --
 Thank you.

 Alex Druk
 alex.d...@gmail.com
 (775) 237-8550 Google voice

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-07 Thread Oliver Keyes

Makes sense! I actually hadn't factored in that sort of action
(although it does happen), more: the order of the main page links on
the root www.wikipedia.org page.

On 7 May 2015 at 03:51, Scott Hale computermacgy...@gmail.com wrote:
 The accept-language header is the obvious place to start, but there is amble
 scope to combine multiple approaches together.

 In addition to accept-language and geolocation data, any logged in user will
 have view/edit history related to multiple editions. If the user is
 requesting a specific article, (e.g., https://www.wikipedia.org/wiki/普天間飛行場
 ) we also can take account of what editions actually have the article ---
 the vast majority of content on Wikipedia only exists in one language or a
 few languages. (I.e., the above link redirects me to create the article on
 en-wiki although it exists on ja-wiki and Japanese is my second preferred
 language by my accept-language header and is an edition I edit captured in
 my edit history)

 This isn't an either-or question of which to use, but rather a question of
 how all these indicators can be used together to create the best experience.
 I would venture that most users don't change their accept-language header
 (not even possible on some mobile browsers!) and hence probably list give
 only one language. If so, geography and edit history can be signals for
 possible second languages beyond the one language in the accept-language
 header when hitting the homepage without a specific article.

 Cheers,
 Scott

 P.S. It looks like the Universal Language Selector already uses the
 accept-language header for its preference screen.

 On Thu, May 7, 2015 at 5:58 AM, Oliver Keyes oke...@wikimedia.org wrote:

 As I've now said...4 times, I don't think we'd be using geolocation.
 We'd be using the accept-language header. See
 https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language

 On 7 May 2015 at 00:52, WereSpielChequers werespielchequ...@gmail.com
 wrote:
  When a reader comes to Wikipedia from the web we can detect their IP
  address and that usually geolocates them to a country. More often than not
  that then tells you the dominant language of that country.
 
  If we were to default to official or dominant languages then I predict
  endless arguments as to which language(s) should be the default in which
  countries. The large expat community in some parts of the Arab world might
  prefer English over Arabic. India would want to do things by state, and a
  whole new front would emerge in the Israeli Palestine debate.
 
  Regards
 
  Jonathan Cardy
 
 


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-07 Thread Oliver Keyes

Thanks for the bugs, Nemo!

(search team: should we take those over?)

On 7 May 2015 at 03:08, Federico Leva (Nemo) nemow...@gmail.com wrote:
 Thanks for looking into www.wikipedia.org traffic from India; I've been
 complaining about it for a while. :) See also:
 * https://phabricator.wikimedia.org/T26767
 * https://phabricator.wikimedia.org/T5665

 Mark J. Nelson, 07/05/2015 04:24:

 But for the average Copenhagener, the following order is far more
 likely:

 * Danish, English, Norwegian Bokmål, ...


 This is something you can help fix. Please do!
 https://www.mediawiki.org/wiki/ULS/FAQ#language-territory

 Nemo


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-07 Thread Oliver Keyes

Interesting! This I didn't know; I'll factor it in :).

On 7 May 2015 at 04:48, Stuart A. Yeates syea...@gmail.com wrote:
 Accept-language is systematically broken for minority languages within
 dominant language communities. In New Zealand, a country with three official
 languages and a textbook case of language revivalism, I've never met anyone
 without a degree in computer science who sets accept-language, and I've
 never seen a computer system which ships with all three official languages
 selectable. Most computer systems ship with en or en-us as the default.

 If there were silver bullets in this area, the solution would be obvious and
 we wouldn't even be thinking about having this conversation.

 cheers
 stuart

 On Thursday, May 7, 2015, Oliver Keyes oke...@wikimedia.org wrote:

 As I've now said...4 times, I don't think we'd be using geolocation.
 We'd be using the accept-language header. See
 https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language

 On 7 May 2015 at 00:52, WereSpielChequers werespielchequ...@gmail.com
 wrote:
  When a reader comes to Wikipedia from the web we can detect their IP
  address and that usually geolocates them to a country. More often than not
  that then tells you the dominant language of that country.
 
  If we were to default to official or dominant languages then I predict
  endless arguments as to which language(s) should be the default in which
  countries. The large expat community in some parts of the Arab world might
  prefer English over Arabic. India would want to do things by state, and a
  whole new front would emerge in the Israeli Palestine debate.
 
  Regards
 
  Jonathan Cardy
 
 
  On 7 May 2015, at 05:06, Sam Katz smk...@gmail.com wrote:
 
  hey guys, you can't guess geolocation, because occasionally you'd be
  wrong. this happens to me all the time. I want to read a site in
  spanish... and then it thinks I'm in Latin America, when I'm not.
 
  --Sam
 
  On Wed, May 6, 2015 at 10:07 PM, Oliver Keyes oke...@wikimedia.org
  wrote:
  Possibly. But that sounds potentially wooly and sometimes inaccurate.
 
  When a browser makes a web request, it sends a header called the
  accept_language header
 
  (https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language)
  which indicates what languages the browser finds ideal - i.e., what
  languages the user and system are using.
 
  If we're going to make modifications here (I hope we will. But again;
  early days) I don't see a good argument for using geolocation, which
  is, as you've noted, flawed without substantial time and energy being
  applied to map those countries to probable languages. The data the
  browser already sends to the server contains the /certain/ languages.
  We can just use that.
 
  On 6 May 2015 at 22:50, Stuart A. Yeates syea...@gmail.com wrote:
  This seems like a great place to use analytics data, for each
  division
  in the geo-location classification, rank each of the languages by
  usage and present the top N as likely candidates (+ browser settings)
  when we need the user to pick a language.
 
  cheers
  stuart
  --
  ...let us be heard from red core to black sky
 
 
  On Thu, May 7, 2015 at 2:24 PM, Mark J. Nelson m...@anadrome.org
  wrote:
 
  Stuart A. Yeates syea...@gmail.com writes:
 
  Reading that excellent presentation, the thought that struck me
  was:
 
  If I wanted to subvert the assumption that Wikipedia == en.wiki,
  linking to http://www.wikipedia.org/ is what I'd do.
 
  A smarter http://www.wikipedia.org/ might guess geo-location and
  thus
  local languages.
 
  I'd also like to see something smarter done at the main page, but
  the
  and thus bit here is notoriously tricky.
 
  For example most geolocation-based things, like Wikidata by default,
  tend to produce funny results in Denmark. A Copenhagener is offered
  something like this choice, in order:
 
  * Danish, Greelandic, Faroese, Swedish, German, ...
 
  The reasoning here is that Danish, Greenlandic, and Faroese are
  official
  languages of the Danish Realm, which includes both Denmark proper,
  and
  two autonomous territories, Greeland and the Faroe Islands. And then
  Sweden and Germany are the two neighboring countries.
 
  But for the average Copenhagener, the following order is far more
  likely:
 
  * Danish, English, Norwegian Bokmål, ...
 
  The reason here is that Norwegian Bokmål is very close to Danish in
  written form (more than Swedish is, and especially more than Faroese
  is)
  while English is a widely used semi-official language in business,
  government, and education (for example about half of university
  theses
  are now written in English, and several major companies use it as
  their
  official workplace language).
 
  I think it's possible to come up with something that better aligns
  with
  readers' actual preferences, but it's not easy!
 
  -Mark
 
  --
  Mark J. Nelson
  Anadrome Research
  http://www.kmjn.org

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

Traffic through Wikipedia zero; apologies for not being clear.

On 6 May 2015 at 19:56, Sam Katz smk...@gmail.com wrote:
 hey oliver,

 I don't mean to be a help vampire...

 but what is zero traffic? you think the traffic is being proxied?
 perhaps even reverse proxied?

 --Sam

 On Wed, May 6, 2015 at 1:40 PM, Oliver Keyes oke...@wikimedia.org wrote:
 Cross-posting to research and analytics, too!


 -- Forwarded message --
 From: Oliver Keyes oke...@wikimedia.org
 Date: 6 May 2015 at 13:11
 Subject: Traffic to the portal from Zero providers
 To: wikimedia-sea...@lists.wikimedia.org


 Hey all,

 (Throwing this to the public list, because transparency is Good)

 I recently did a presentation on a traffic analysis to the Wikipedia
 home page - www.wikipedia.org.[1]

 One of the biggest visualisations, in impact terms, showed that a lot
 of portal traffic - far more, proportionately, than traffic to
 Wikipedia overall - is coming from India and Brazil.[2] One of the
 hypotheses was that this could be Zero traffic.

 I've done a basic analysis of the traffic, looking specifically at the
 zero headers,[3] and this hypothesis turns out to be incorrect -
 almost no zero traffic is hitting the portal. The traffic we're seeing
 from Brazil and India is not zero-based.

 This makes a lot of sense (the reason mobile traffic redirects to the
 enwiki home page from the portal is the Zero extension, so presumably
 this happens specifically to Zero traffic) but it does mean that our
 null hypothesis - that this traffic is down to ISP-level or
 device-level design choices and links - is more likely to be correct.

 [1] http://ironholds.org/misc/homepage_presentation.html
 [2] http://ironholds.org/misc/homepage_presentation.html#/11
 [3] https://phabricator.wikimedia.org/T98076

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation


 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

Cross-posting to research and analytics, too!

-- Forwarded message --
From: Oliver Keyes oke...@wikimedia.org
Date: 6 May 2015 at 13:11
Subject: Traffic to the portal from Zero providers
To: wikimedia-sea...@lists.wikimedia.org

Hey all,

(Throwing this to the public list, because transparency is Good)

I recently did a presentation on a traffic analysis to the Wikipedia
home page - www.wikipedia.org.[1]

One of the biggest visualisations, in impact terms, showed that a lot
of portal traffic - far more, proportionately, than traffic to
Wikipedia overall - is coming from India and Brazil.[2] One of the
hypotheses was that this could be Zero traffic.

I've done a basic analysis of the traffic, looking specifically at the
zero headers,[3] and this hypothesis turns out to be incorrect -
almost no zero traffic is hitting the portal. The traffic we're seeing
from Brazil and India is not zero-based.

This makes a lot of sense (the reason mobile traffic redirects to the
enwiki home page from the portal is the Zero extension, so presumably
this happens specifically to Zero traffic) but it does mean that our
null hypothesis - that this traffic is down to ISP-level or
device-level design choices and links - is more likely to be correct.

[1] http://ironholds.org/misc/homepage_presentation.html
[2] http://ironholds.org/misc/homepage_presentation.html#/11
[3] https://phabricator.wikimedia.org/T98076

--
Oliver Keyes
Research Analyst
Wikimedia Foundation

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

Agreed! That's one of the changes I'd really like to push ahead with,
although we're going to do some more in-depth data collection before
any redesign :).

On 6 May 2015 at 20:27, Stuart A. Yeates syea...@gmail.com wrote:
 Reading that excellent presentation, the thought that struck me was:

 If I wanted to subvert the assumption that Wikipedia == en.wiki,
 linking to http://www.wikipedia.org/ is what I'd do.

 A smarter http://www.wikipedia.org/ might guess geo-location and thus
 local languages.

 cheers
 stuart

 --
 ...let us be heard from red core to black sky


 On Thu, May 7, 2015 at 6:40 AM, Oliver Keyes oke...@wikimedia.org wrote:
 Cross-posting to research and analytics, too!


 -- Forwarded message --
 From: Oliver Keyes oke...@wikimedia.org
 Date: 6 May 2015 at 13:11
 Subject: Traffic to the portal from Zero providers
 To: wikimedia-sea...@lists.wikimedia.org


 Hey all,

 (Throwing this to the public list, because transparency is Good)

 I recently did a presentation on a traffic analysis to the Wikipedia
 home page - www.wikipedia.org.[1]

 One of the biggest visualisations, in impact terms, showed that a lot
 of portal traffic - far more, proportionately, than traffic to
 Wikipedia overall - is coming from India and Brazil.[2] One of the
 hypotheses was that this could be Zero traffic.

 I've done a basic analysis of the traffic, looking specifically at the
 zero headers,[3] and this hypothesis turns out to be incorrect -
 almost no zero traffic is hitting the portal. The traffic we're seeing
 from Brazil and India is not zero-based.

 This makes a lot of sense (the reason mobile traffic redirects to the
 enwiki home page from the portal is the Zero extension, so presumably
 this happens specifically to Zero traffic) but it does mean that our
 null hypothesis - that this traffic is down to ISP-level or
 device-level design choices and links - is more likely to be correct.

 [1] http://ironholds.org/misc/homepage_presentation.html
 [2] http://ironholds.org/misc/homepage_presentation.html#/11
 [3] https://phabricator.wikimedia.org/T98076

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation


 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

Totally! As said, I think accept-language is a better variable to
operate from. But these are early days; we're just beginning to
understand the space. Realistically, software changes will come a lot
later :)

On 6 May 2015 at 22:24, Mark J. Nelson m...@anadrome.org wrote:

 Stuart A. Yeates syea...@gmail.com writes:

 Reading that excellent presentation, the thought that struck me was:

 If I wanted to subvert the assumption that Wikipedia == en.wiki,
 linking to http://www.wikipedia.org/ is what I'd do.

 A smarter http://www.wikipedia.org/ might guess geo-location and thus
 local languages.

 I'd also like to see something smarter done at the main page, but the
 and thus bit here is notoriously tricky.

 For example most geolocation-based things, like Wikidata by default,
 tend to produce funny results in Denmark. A Copenhagener is offered
 something like this choice, in order:

 * Danish, Greelandic, Faroese, Swedish, German, ...

 The reasoning here is that Danish, Greenlandic, and Faroese are official
 languages of the Danish Realm, which includes both Denmark proper, and
 two autonomous territories, Greeland and the Faroe Islands. And then
 Sweden and Germany are the two neighboring countries.

 But for the average Copenhagener, the following order is far more
 likely:

 * Danish, English, Norwegian Bokmål, ...

 The reason here is that Norwegian Bokmål is very close to Danish in
 written form (more than Swedish is, and especially more than Faroese is)
 while English is a widely used semi-official language in business,
 government, and education (for example about half of university theses
 are now written in English, and several major companies use it as their
 official workplace language).

 I think it's possible to come up with something that better aligns with
 readers' actual preferences, but it's not easy!

 -Mark

 --
 Mark J. Nelson
 Anadrome Research
 http://www.kmjn.org

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

Possibly. But that sounds potentially wooly and sometimes inaccurate.

When a browser makes a web request, it sends a header called the
accept_language header
(https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language)
which indicates what languages the browser finds ideal - i.e., what
languages the user and system are using.

If we're going to make modifications here (I hope we will. But again;
early days) I don't see a good argument for using geolocation, which
is, as you've noted, flawed without substantial time and energy being
applied to map those countries to probable languages. The data the
browser already sends to the server contains the /certain/ languages.
We can just use that.

On 6 May 2015 at 22:50, Stuart A. Yeates syea...@gmail.com wrote:
 This seems like a great place to use analytics data, for each division
 in the geo-location classification, rank each of the languages by
 usage and present the top N as likely candidates (+ browser settings)
 when we need the user to pick a language.

 cheers
 stuart
 --
 ...let us be heard from red core to black sky


 On Thu, May 7, 2015 at 2:24 PM, Mark J. Nelson m...@anadrome.org wrote:

 Stuart A. Yeates syea...@gmail.com writes:

 Reading that excellent presentation, the thought that struck me was:

 If I wanted to subvert the assumption that Wikipedia == en.wiki,
 linking to http://www.wikipedia.org/ is what I'd do.

 A smarter http://www.wikipedia.org/ might guess geo-location and thus
 local languages.

 I'd also like to see something smarter done at the main page, but the
 and thus bit here is notoriously tricky.

 For example most geolocation-based things, like Wikidata by default,
 tend to produce funny results in Denmark. A Copenhagener is offered
 something like this choice, in order:

 * Danish, Greelandic, Faroese, Swedish, German, ...

 The reasoning here is that Danish, Greenlandic, and Faroese are official
 languages of the Danish Realm, which includes both Denmark proper, and
 two autonomous territories, Greeland and the Faroe Islands. And then
 Sweden and Germany are the two neighboring countries.

 But for the average Copenhagener, the following order is far more
 likely:

 * Danish, English, Norwegian Bokmål, ...

 The reason here is that Norwegian Bokmål is very close to Danish in
 written form (more than Swedish is, and especially more than Faroese is)
 while English is a widely used semi-official language in business,
 government, and education (for example about half of university theses
 are now written in English, and several major companies use it as their
 official workplace language).

 I think it's possible to come up with something that better aligns with
 readers' actual preferences, but it's not easy!

 -Mark

 --
 Mark J. Nelson
 Anadrome Research
 http://www.kmjn.org

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

One thing we could also do is check the accept_language header and
prioritise around that; that way we'd be prioritising specifically
the language the user's browser thinks they want.

On 6 May 2015 at 21:28, Stuart A. Yeates syea...@gmail.com wrote:
 Probably also an excellent time to consider whether we can do anything
 for those languages which don't have wikis yet.

 For example, I'm in .nz, which has en, mi and nzs as official
 languages, but we're a long way from an nzs.wiki, given that ase.wiki
 is still in incubator. With the release of Unicode 8 with Sutton
 SignWriting in June, these may or may not kick off in a big way.

 cheers
 stuart
 --
 ...let us be heard from red core to black sky


 On Thu, May 7, 2015 at 12:34 PM, Oliver Keyes oke...@wikimedia.org wrote:
 Agreed! That's one of the changes I'd really like to push ahead with,
 although we're going to do some more in-depth data collection before
 any redesign :).

 On 6 May 2015 at 20:27, Stuart A. Yeates syea...@gmail.com wrote:
 Reading that excellent presentation, the thought that struck me was:

 If I wanted to subvert the assumption that Wikipedia == en.wiki,
 linking to http://www.wikipedia.org/ is what I'd do.

 A smarter http://www.wikipedia.org/ might guess geo-location and thus
 local languages.

 cheers
 stuart

 --
 ...let us be heard from red core to black sky


 On Thu, May 7, 2015 at 6:40 AM, Oliver Keyes oke...@wikimedia.org wrote:
 Cross-posting to research and analytics, too!


 -- Forwarded message --
 From: Oliver Keyes oke...@wikimedia.org
 Date: 6 May 2015 at 13:11
 Subject: Traffic to the portal from Zero providers
 To: wikimedia-sea...@lists.wikimedia.org


 Hey all,

 (Throwing this to the public list, because transparency is Good)

 I recently did a presentation on a traffic analysis to the Wikipedia
 home page - www.wikipedia.org.[1]

 One of the biggest visualisations, in impact terms, showed that a lot
 of portal traffic - far more, proportionately, than traffic to
 Wikipedia overall - is coming from India and Brazil.[2] One of the
 hypotheses was that this could be Zero traffic.

 I've done a basic analysis of the traffic, looking specifically at the
 zero headers,[3] and this hypothesis turns out to be incorrect -
 almost no zero traffic is hitting the portal. The traffic we're seeing
 from Brazil and India is not zero-based.

 This makes a lot of sense (the reason mobile traffic redirects to the
 enwiki home page from the portal is the Zero extension, so presumably
 this happens specifically to Zero traffic) but it does mean that our
 null hypothesis - that this traffic is down to ISP-level or
 device-level design choices and links - is more likely to be correct.

 [1] http://ironholds.org/misc/homepage_presentation.html
 [2] http://ironholds.org/misc/homepage_presentation.html#/11
 [3] https://phabricator.wikimedia.org/T98076

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation


 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers

2015-05-06 Thread Oliver Keyes

As I've now said...4 times, I don't think we'd be using geolocation.
We'd be using the accept-language header. See
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language

On 7 May 2015 at 00:52, WereSpielChequers werespielchequ...@gmail.com wrote:
 When a reader comes to Wikipedia from the web we can detect their IP address 
 and that usually geolocates them to a country. More often than not that then 
 tells you the dominant language of that country.

 If we were to default to official or dominant languages then I predict 
 endless arguments as to which language(s) should be the default in which 
 countries. The large expat community in some parts of the Arab world might 
 prefer English over Arabic. India would want to do things by state, and a 
 whole new front would emerge in the Israeli Palestine debate.

 Regards

 Jonathan Cardy


 On 7 May 2015, at 05:06, Sam Katz smk...@gmail.com wrote:

 hey guys, you can't guess geolocation, because occasionally you'd be
 wrong. this happens to me all the time. I want to read a site in
 spanish... and then it thinks I'm in Latin America, when I'm not.

 --Sam

 On Wed, May 6, 2015 at 10:07 PM, Oliver Keyes oke...@wikimedia.org wrote:
 Possibly. But that sounds potentially wooly and sometimes inaccurate.

 When a browser makes a web request, it sends a header called the
 accept_language header
 (https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language)
 which indicates what languages the browser finds ideal - i.e., what
 languages the user and system are using.

 If we're going to make modifications here (I hope we will. But again;
 early days) I don't see a good argument for using geolocation, which
 is, as you've noted, flawed without substantial time and energy being
 applied to map those countries to probable languages. The data the
 browser already sends to the server contains the /certain/ languages.
 We can just use that.

 On 6 May 2015 at 22:50, Stuart A. Yeates syea...@gmail.com wrote:
 This seems like a great place to use analytics data, for each division
 in the geo-location classification, rank each of the languages by
 usage and present the top N as likely candidates (+ browser settings)
 when we need the user to pick a language.

 cheers
 stuart
 --
 ...let us be heard from red core to black sky


 On Thu, May 7, 2015 at 2:24 PM, Mark J. Nelson m...@anadrome.org wrote:

 Stuart A. Yeates syea...@gmail.com writes:

 Reading that excellent presentation, the thought that struck me was:

 If I wanted to subvert the assumption that Wikipedia == en.wiki,
 linking to http://www.wikipedia.org/ is what I'd do.

 A smarter http://www.wikipedia.org/ might guess geo-location and thus
 local languages.

 I'd also like to see something smarter done at the main page, but the
 and thus bit here is notoriously tricky.

 For example most geolocation-based things, like Wikidata by default,
 tend to produce funny results in Denmark. A Copenhagener is offered
 something like this choice, in order:

 * Danish, Greelandic, Faroese, Swedish, German, ...

 The reasoning here is that Danish, Greenlandic, and Faroese are official
 languages of the Danish Realm, which includes both Denmark proper, and
 two autonomous territories, Greeland and the Faroe Islands. And then
 Sweden and Germany are the two neighboring countries.

 But for the average Copenhagener, the following order is far more
 likely:

 * Danish, English, Norwegian Bokmål, ...

 The reason here is that Norwegian Bokmål is very close to Danish in
 written form (more than Swedish is, and especially more than Faroese is)
 while English is a widely used semi-official language in business,
 government, and education (for example about half of university theses
 are now written in English, and several major companies use it as their
 official workplace language).

 I think it's possible to come up with something that better aligns with
 readers' actual preferences, but it's not easy!

 -Mark

 --
 Mark J. Nelson
 Anadrome Research
 http://www.kmjn.org

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki

[Wiki-research-l] [Announce] a new release of Pageviews data

2015-04-26 Thread Oliver Keyes

Hey all,

We've just released a count of pageviews to the English-language
Wikipedia from 2015-03-16T00:00:00 to 2015-04-25T15:59:59, grouped by
timestamp (down to a one-second resolution level) and site (mobile or
desktop).

The smallest number of events in a group is 645; because of this, we
are confident there should not be privacy implications of releasing
this data. We checked with legal first ;p. If you're interested in
getting your mitts on it, you can find it at DataHub
(http://datahub.io/dataset/english-wikipedia-pageviews-by-second) or
FigShare 
(http://figshare.com/articles/English_Wikipedia_pageviews_by_second/1394684)



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 116, Issue 16

2015-04-09 Thread Oliver Keyes

...
 URL:
 https://lists.wikimedia.org/pipermail/wiki-research-l/attachments/20150408/644001dd/attachment-0001.html

 --

 Message: 2
 Date: Wed, 8 Apr 2015 11:19:10 +
 From: Flöck, Fabian fabian.flo...@gesis.org
 To: Research into Wikimedia content and communities
 wiki-research-l@lists.wikimedia.org
 Subject: Re: [Wiki-research-l] Research on Wikidata's content coverage
 Message-ID: 11a3af77-82f0-4b3e-a865-6010ebdfc...@gesis.org
 Content-Type: text/plain; charset=iso-8859-1

 Hi Oliver,

 from the top of my head, two on gender coverage:
 the one Max just sent around:
 http://ijoc.org/index.php/ijoc/article/view/777/631
 and another one, with a different approach, but a similar goal:
 http://arxiv.org/abs/1501.06307

 We had one on diversity that also has a small section about
 representativeness of the editor base, although it might not be exactly what
 you are looking for: http://journal.webscience.org/432/1/112_paper.pdf

 Gruß,
 Fabian


 On 07.04.2015, at 21:50, Oliver Keyes
 oke...@wikimedia.orgmailto:oke...@wikimedia.org wrote:

 Hey all,

 Is anyone aware of research on the completeness of Wikidata, in terms
 of coverage and systemic bias? This seems like the sort of thing Max
 Klein might know ;). Papers, blog posts, anything.

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list

 Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 Cheers,
 Fabian

 --
 Fabian Flöck
 Research Associate
 Computational Social Science department @GESIS
 Unter Sachsenhausen 6-8, 50667 Cologne, Germany
 Tel: + 49 (0) 221-47694-208
 fabian.flo...@gesis.orgmailto:fabian.flo...@gesis.org

 www.gesis.org
 www.facebook.com/gesis.org





 -- next part --
 An HTML attachment was scrubbed...
 URL:
 https://lists.wikimedia.org/pipermail/wiki-research-l/attachments/20150408/c5205267/attachment-0001.html

 --

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


 End of Wiki-research-l Digest, Vol 116, Issue 16
 


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Research on Wikidata's content coverage

2015-04-08 Thread Oliver Keyes

Thanks both!

I'm specifically looking at Wikidata's coverage, rather than
Wikipedia's - in other words, work done on deficiencies in the mapping
of wikimedia content onto wikidata content.

On 8 April 2015 at 07:19, Flöck, Fabian fabian.flo...@gesis.org wrote:
 Hi Oliver,

 from the top of my head, two on gender coverage:
 the one Max just sent around:
 http://ijoc.org/index.php/ijoc/article/view/777/631
 and another one, with a different approach, but a similar goal:
 http://arxiv.org/abs/1501.06307

 We had one on diversity that also has a small section about
 representativeness of the editor base, although it might not be exactly what
 you are looking for: http://journal.webscience.org/432/1/112_paper.pdf

 Gruß,
 Fabian


 On 07.04.2015, at 21:50, Oliver Keyes oke...@wikimedia.org wrote:

 Hey all,

 Is anyone aware of research on the completeness of Wikidata, in terms
 of coverage and systemic bias? This seems like the sort of thing Max
 Klein might know ;). Papers, blog posts, anything.

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





 Cheers,
 Fabian

 --
 Fabian Flöck
 Research Associate
 Computational Social Science department @GESIS
 Unter Sachsenhausen 6-8, 50667 Cologne, Germany
 Tel: + 49 (0) 221-47694-208
 fabian.flo...@gesis.org

 www.gesis.org
 www.facebook.com/gesis.org






 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Research on Wikidata's content coverage

2015-04-08 Thread Oliver Keyes

Perfect; thank you!

On 8 April 2015 at 09:53, Finn Årup Nielsen f...@imm.dtu.dk wrote:
 Dear Oliver,

 On 04/08/2015 03:38 PM, Oliver Keyes wrote:

 Thanks both!

 I'm specifically looking at Wikidata's coverage, rather than
 Wikipedia's - in other words, work done on deficiencies in the mapping
 of wikimedia content onto wikidata content.


 Oh, I didn't see it was Wikidata instead of Wikpedia.

 Wikipedia research and tools: Review and comments.
 http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6012/pdf/imm6012.pdf

 contains pointers to the Max Klein/Piotr Konieczny studies and Magnus
 Manske's
 Mix’n’match (presently page 11). Magnus Manske has a blog post recently:

 http://magnusmanske.de/wordpress/?p=278
 Sex and artists

 If I remember correctly wikidata-l had some discussion about that. Probably
 you know that already.


 best
 Finn Årup Nielsen




 On 8 April 2015 at 07:19, Flöck, Fabian fabian.flo...@gesis.org wrote:

 Hi Oliver,

 from the top of my head, two on gender coverage:
 the one Max just sent around:
 http://ijoc.org/index.php/ijoc/article/view/777/631
 and another one, with a different approach, but a similar goal:
 http://arxiv.org/abs/1501.06307

 We had one on diversity that also has a small section about
 representativeness of the editor base, although it might not be exactly
 what
 you are looking for: http://journal.webscience.org/432/1/112_paper.pdf

 Gruß,
 Fabian


 On 07.04.2015, at 21:50, Oliver Keyes oke...@wikimedia.org wrote:

 Hey all,

 Is anyone aware of research on the completeness of Wikidata, in terms
 of coverage and systemic bias? This seems like the sort of thing Max
 Klein might know ;). Papers, blog posts, anything.

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





 Cheers,
 Fabian

 --
 Fabian Flöck
 Research Associate
 Computational Social Science department @GESIS
 Unter Sachsenhausen 6-8, 50667 Cologne, Germany
 Tel: + 49 (0) 221-47694-208
 fabian.flo...@gesis.org

 www.gesis.org
 www.facebook.com/gesis.org






 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l






 --
 Finn Årup Nielsen
 http://people.compute.dtu.dk/faan/


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Research on Wikidata's content coverage

2015-04-07 Thread Oliver Keyes

Hey all,

Is anyone aware of research on the completeness of Wikidata, in terms
of coverage and systemic bias? This seems like the sort of thing Max
Klein might know ;). Papers, blog posts, anything.

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] rc stream

2015-04-07 Thread Oliver Keyes

Well, if you got all your learning out of the way in the first email,
I'm really confused as to what you thought a backhanded none of you
are helping would do 3 hours later. You asked an honest question, you
got a very reasonable and perfectly friendly reply, and then decided,
I guess, that the thread really wouldn't be complete without
denigrating the people who were trying to help. I'd like to expect
more from list subscribers than that.

On 7 April 2015 at 16:50, Ed Summers e...@pobox.com wrote:
 Ok, my apologies if this is coming out garbled. Here’s a list of things I 
 think I’ve learned as part of this discussion:

 1) currently there is no plan to do away with the IRC stream

 //Ed

 On Apr 7, 2015, at 4:45 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote:

 Oh!  Well if you understood Yuvi right away, it seems that you *did* get a 
 clear answer out of us all.

 On Tue, Apr 7, 2015 at 3:42 PM, Ed Summers e...@pobox.com wrote:

  On Apr 7, 2015, at 4:07 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote:
 
  Really, RCStream is what the IRC feed ought to have been -- and probably 
  would have been if those standards were available at the time of its 
  construction.   RCStream solves the same problem better.

 Actually, I understood the first time and I agree with this assessment. I 
 still don’t find it to be a compelling reason to go and do the work. But it 
 probably would’ve taken about as long as it has to try to get a clear answer 
 out of you all :-)

 //Ed



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] rc stream

2015-04-07 Thread Oliver Keyes

Thanks; that's most appreciated. For what it's worth, I'll try not to
instinctively go at anyone who's mean to Yuvi.[0]


[0] Only I get to be mean to Yuvi. Me and whoever was cruel enough to
make him run Labs ;p

On 7 April 2015 at 16:58, Ed Summers e...@pobox.com wrote:

 On Apr 7, 2015, at 4:54 PM, Oliver Keyes oke...@wikimedia.org wrote:

 Well, if you got all your learning out of the way in the first email,
 I'm really confused as to what you thought a backhanded none of you
 are helping would do 3 hours later. You asked an honest question, you
 got a very reasonable and perfectly friendly reply, and then decided,
 I guess, that the thread really wouldn't be complete without
 denigrating the people who were trying to help. I'd like to expect
 more from list subscribers than that.

 I apologize if what I said was denigrating. I can see how it could’ve been 
 seen that way, and I regret it. I sincerely appreciate the help that has been 
 offered.

 And you’re right to expect more of list subscribers. I’ll do better.

 //Ed

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wikipedia search logs needed

2015-04-03 Thread Oliver Keyes

+1. Valerio, I assume you're a researcher familiar with anonymisation;
you should cast your eye over the AOL search log debacle. The only way
to completely sanitise the logs is to remove all the query strings.

Simon, it sounds like performing this kind of sanitisation would
undermine the work you're doing, which is unfortunate :(. However, if
you can make a strong pitch for this being of value to the Wikimedia
communit(y|ies) I would encourage you to make that pitch; maybe we can
look into an NDA! You might want to check out
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_edits
for an example of an anonymisation proposal Shilad submitted to
accompany a request for dataset releases (mind you, I generally think
everyone should read everything Shilad writes, but that's besides the
point ;p)

On 3 April 2015 at 11:03, Aaron Halfaker aaron.halfa...@gmail.com wrote:
It turns out that anonymization is hard(see [1,2,3]). A quick web search
would have made that clear.

We do sometimes provide researchers with NDAs for the purposes of
anonymizing data. Again, we have limited time an energy, so such NDAs have
been (1) limited to work that is immediately relevant to our own and (2)
exists for the purpose of anonymizing and making the data public -- so that
everyone can benefit.

For example, see an project aimed to release anonymized view logs[4]. That
proposal has been in process for more than a year though because legal
agreements with national research labs are Hard. It seems like search logs
are a candidate for that process, but we'd need to see an anonymization
proposal before moving forward.

1. https://en.wikipedia.org/wiki/K-anonymity
2. https://en.wikipedia.org/wiki/L-diversity
3. https://en.wikipedia.org/wiki/T-closeness
4.
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews

-Aaron

On Fri, Apr 3, 2015 at 9:47 AM, Valerio Schiavoni
valerio.schiav...@gmail.com wrote:

Those logs could have been cleaned up further and re-released, especially
since the privacy issues had an impact only on small percentage of queries
.
Frankly, it's a pity that after the initial announcement they had to
quickly retract.

Nonetheless Wikimedia could release them for research purposes, asking
interested users to sign NDA or such.
I would be very surprised to discover that in 2015 there are no means to
properly anonymize datasets and release them to the public.

best,
Valerio

On Fri, Apr 3, 2015 at 4:38 PM, Aaron Halfaker aaron.halfa...@gmail.com
wrote:

Maybe someone managed to grab those logs before they took them offline.

If they did, I hope they won't share. They were taken offline due to
privacy issues.

From the blog post: We’ve temporarily taken down this data to make
additional improvements to the anonymization protocol related to the search
queries.

-Aaron

On Fri, Apr 3, 2015 at 9:32 AM, Valerio Schiavoni
valerio.schiav...@gmail.com wrote:

There has been at least one attempt to release such data:

http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/

Maybe someone managed to grab those logs before they took them offline.

Similar but older logs are available here:
http://www.wikibench.eu/

best,
valerio

On Fri, Apr 3, 2015 at 4:09 PM, Pine W wiki.p...@gmail.com wrote:

Hi Oliver,

Do we even record search logs? It might be a good idea if we didn't.

Pine

On Apr 3, 2015 6:16 AM, Simon Givoli givo...@gmail.com wrote:

Thanks Oliver,

Sorry if I wasn't clear enough.
My dissertation will involve consented participants. Their search logs
will be recorded while searching Wikipedia. The search logs will then be
analyzed in order to find recurrent search patterns across participants.
Before beginning the experiment, I want to check that I can indeed
find patterns in search logs, using several different algorithms. The
idea
is to check these algorithms on Wikipedia search logs already available.
Hence my request.

Simon
Message: 5
Date: Fri, 3 Apr 2015 09:37:33 +0300
From: Simon Givoli givo...@gmail.com
To: wiki-research-l@lists.wikimedia.org
Subject: [Wiki-research-l] Wikipedia search logs needed
Message-ID:

CAN=5oghnqs+knobenmsb1cwhfm-w-mxgqas5gcnfn7qru1k...@mail.gmail.com
Content-Type: text/plain; charset=utf-8

Hi,

I'm looking for a dump or db of Wikipedia users search logs. I would
like
it to be with recent data, but it doesn't have to be extensive, even a
small sample size would be sufficient. I aim to use this db to test a
new
research tool I'm developing for my dissertation.

Can anyone point me to a relevant source?

Thanks'
Simon
-- next part --
An HTML attachment was scrubbed...
URL:
https://lists.wikimedia.org/pipermail/wiki-research-l/attachments/20150403/e51fd17c/attachment-0001.html

Message: 6
Date: Fri, 3 Apr 2015 02:47:54 -0400
From: Oliver Keyes

Re: [Wiki-research-l] Anyone have access to this article?

2015-04-01 Thread Oliver Keyes

I have to say that a WMF staffer using their official WMF account to
imply they're doing legitimate work to study, understand and improve
community dynamics is not a good look

Cheers,
Oliver

On 1 April 2015 at 18:02, Jonathan Morgan jmor...@wikimedia.org wrote:
 Well, I was going to print out a bunch of copies and then sell them down on
 the corner, but I guess now I'll just use it to inform the development of a
 coding scheme for rating civility in Wikipedia talkpage comments.

 - J

 On Wed, Apr 1, 2015 at 3:00 PM, Stuart A. Yeates syea...@gmail.com wrote:

 I think you mean might have been permissible, if the original request
 had included the intended use.

 cheers
 stuart

 --
 ...let us be heard from red core to black sky


 On Thu, Apr 2, 2015 at 10:57 AM, Nicole Askin nask...@alumni.uwo.ca
 wrote:
  Stuart, this is permissible per Wiley's terms of use - Authorized Users
  may
  also transmit such material to a third-party colleague in hard copy or
  electronically for personal use or scholarly, educational, or scientific
  research or professional use.
  Nicole
 
  On Wed, Apr 1, 2015 at 2:52 PM, Stuart A. Yeates syea...@gmail.com
  wrote:
 
  I have to say that a WMF staffer using their official WMF account to
  ask community members to commit copyright infringement is not a good
  look.
 
  cheers
  stuart
  --
  ...let us be heard from red core to black sky
 
 
  On Thu, Apr 2, 2015 at 10:48 AM, Jonathan Morgan
  jmor...@wikimedia.org
  wrote:
   http://onlinelibrary.wiley.com/doi/10./jcom.12123/abstract
  
   What Creates Interactivity in Online News Discussions? An
   Exploratory
   Analysis of Discussion Factors in User Comments on News Items
  
   If you have access, and can send me a PDF offline, I would be very
   grateful
   :)
  
   Cheers,
   Jonathan
  
  
   --
   Jonathan T. Morgan
   Community Research Lead
   Wikimedia Foundation
   User:Jmorgan (WMF)
   jmor...@wikimedia.org
  
  
   ___
   Wiki-research-l mailing list
   Wiki-research-l@lists.wikimedia.org
   https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 
 
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Jonathan T. Morgan
 Community Research Lead
 Wikimedia Foundation
 User:Jmorgan (WMF)
 jmor...@wikimedia.org


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

2015-03-19 Thread Oliver Keyes

Thanks all for the awesome comments :). Will get to tomorrow morning![1]

[1] East coast time.

On 19 March 2015 at 20:37, aaron shaw aarons...@northwestern.edu wrote:
Adding to Giovanni's points (all of which I agree with 100%):

- This would be awesome! The pageviews are a super useful for many of us and
cleaning them up a bit would save a lot of redundant work for many of us
down the road.
- If you don't have to collapse page views incoming from mobile and zero, I
would recommend keeping them separate. That said, I haven't spent any time
looking into it, and so I confess complete ignorance on this front.
- I agree with you that page ids are better than titles. Great idea.
- I don't think the byte information is/was useful in this dataset, so I
agree with dumping that.
- Backfill would be totally great.

Happy to chat more if it seems helpful...

On Thu, Mar 19, 2015 at 7:13 PM, Giovanni Luca Ciampaglia
gciam...@indiana.edu wrote:

Hi Oliver,

Tab-separation would be welcomed. Title normalisation would be *very*
useful too. Another thing that could potentially save a lot of space would
be to throw out all malformed requests, pieces of javascript, and similar
junk. Not sure how difficult that would be though, without doing an actual
query on the DB for the page id.

For example, an excerpt from 20140101-00.gz (with only the title and
views fields):

'/javascript:document.location.href='/'_encodeURIComponent(document.getElementById('txt_input_text').value)
9
'03_Bonnie__Clyde 18
A_Night_at_the_Opera_(Queen_album) 57
'40s_on_4 2
'50s_on_5 1
'71_(film) 4
'74_Jailbreak 3
'77 1

Cheers,

Giovanni Luca Ciampaglia

✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
☞ http://www.glciampaglia.com/
✆ +1 812 855-7261
✉ gciam...@indiana.edu

2015-03-13 12:06 GMT-07:00 Oliver Keyes oke...@wikimedia.org:

So, we've got a new pageviews definition; it's nicely integrated and
spitting out TRUE/FALSE values on each row with the best of em. But
what does that mean for third-party researchers?

Well...not much, at the moment, because the data isn't being released
somewhere. But one resource we do have that third-parties use a heck
of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.

Due to historical size constrains and decision-making (and by
historical I mean: last decade) these have a number of weirdnesses in
formatting terms; project identification is done using a notation
style not really used anywhere else, mobile/zero/desktop appear on
different lines, and the files are space-separated. I'd like to put
some volunteer time into spitting out dumps in an easier-to-work-with
format, using the new definition, to run in /parallel/ with the
existing logs.

*The new format*
At the moment we have the format:

project_notation - encoded_title - pageviews - bytes

This puts zero and mobile requests to pageX in a different place to
desktop requests, requires some reconstruction of project_notation,
and contains (for some use cases) extraneous information - that being
the byte-count. The files are also headerless, unquoted and
space-separated, which saves space but is sometimes...I think the term
is h-inducing.

What I'd like to use as a new format is:

full_project_url - encoded_title - desktop_pageviews -
mobile_and_zero_pageviews

This file would:

1. Include a header row;
2. Be formatted as a tab-separated, rather than space-separated, file;
3. Exclude bytecounts;
4. Include desktop and mobile pageview counts on the same line;
5. Use the full project URL (en.wikivoyage.org) instead of the
pagecounts-specific notation (en.v)

So, as a made-up example, instead of:

de.m.v Florence 32 9024
de.v Florence 920 7570

we'd end up with:

de.wikivoyage.org Florence 920 32

In the future we could also work to /normalise/ the title - replacing
it with the page title that refers to the actual pageID. This won't
impact legacy files, and is currently blocked on the Apps team, but
should be viable as soon as that blocker goes away.

I've written a script capable of parsing and reformatting the legacy
files, so we should be able to backfill in this new format too, if
that's wanted (see below).

*The size constraints*

There really aren't any. Like I said, the historical rationale for a
lot of these decisions seems to have been keeping the files small. But
by putting requests to the same title from different site versions on
the same line, and dropping byte-count, we save enough space that the
resulting files are approximately the same size as the old ones - or
in many cases, actually smaller.

*What I'm asking for*

Feedback! What do people think of the new format? What would they like
to see that they don't? What don't they need, here? How

Re: [Wiki-research-l] (no subject)

2015-03-16 Thread Oliver Keyes

Awesome work! It's interesting to see Finnish as the outlier here. Do
we have any fi-users on the list who can comment on this and might
know what's going on? (And, in the absence of Finns: Jan, heard
anything from across the border? :p)

The only caution I'd raise is that these numbers don't include spider
filtering. Why is this important? Well, a lot of traffic is driven by
crawlers and spiders and automata, particularly on smaller projects,
and it can lead to weirdness as a result. With the granular pagecount
files there's some work that can be done to detect this (for example,
using burst detection and a few heuristics around concentration
measures to eliminate pages that are clearly driven by automated
traffic - see the recent analytics mailing list thread) but only some.
I appreciate this is a flaw in the data we are releasing, not in your
work, which is an excellent read and highly interesting :). I agree
that understanding the lack of development in the PRC and ROK is
crucial - we keep talking about the next billion readers but only
talking :(

On 16 March 2015 at 02:21, h hant...@gmail.com wrote:
Dear all,

I have some findings to show the page views per Internet user
measurement may help comparing different language editions of Wikipedia.
Criticism and suggestions are welcome.

-
http://people.oii.ox.ac.uk/hanteng/2015/03/15/comparing-language-development-in-wikipedia-in-terms-of-page-views-per-internet-users/

Which language version of Wikipedia enjoys the most page views per language
Internet user than expected? It is Finnish. In terms of absolute positive
and negative gap, English has the widest positive gap whereas Chinese has
the largest negative gap.

In particular, it is known that Wikipedia (and Google which often favours
Wikipedia) faces local competition in the People's Republic of China and
South Korea. Therefore it is understandable the page views may be lower in
Chinese and Korean Wikipedia language projects simply because some users'
need to read user-generated encyclopedias are satisfied by other websites.
However, it remains an important question to examine why these particular
Latin and Asian languages are under-developed for Wikipedia projects.

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

2015-03-13 Thread Oliver Keyes

So, we've got a new pageviews definition; it's nicely integrated and
spitting out TRUE/FALSE values on each row with the best of em. But
what does that mean for third-party researchers?

Well...not much, at the moment, because the data isn't being released
somewhere. But one resource we do have that third-parties use a heck
of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.

Due to historical size constrains and decision-making (and by
historical I mean: last decade) these have a number of weirdnesses in
formatting terms; project identification is done using a notation
style not really used anywhere else, mobile/zero/desktop appear on
different lines, and the files are space-separated. I'd like to put
some volunteer time into spitting out dumps in an easier-to-work-with
format, using the new definition, to run in /parallel/ with the
existing logs.

*The new format*
At the moment we have the format:

project_notation - encoded_title - pageviews - bytes

This puts zero and mobile requests to pageX in a different place to
desktop requests, requires some reconstruction of project_notation,
and contains (for some use cases) extraneous information - that being
the byte-count. The files are also headerless, unquoted and
space-separated, which saves space but is sometimes...I think the term
is h-inducing.

What I'd like to use as a new format is:

full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews

This file would:

1. Include a header row;
2. Be formatted as a tab-separated, rather than space-separated, file;
3. Exclude bytecounts;
4. Include desktop and mobile pageview counts on the same line;
5. Use the full project URL (en.wikivoyage.org) instead of the
pagecounts-specific notation (en.v)

So, as a made-up example, instead of:

de.m.v Florence 32 9024
de.v Florence 920 7570

we'd end up with:

de.wikivoyage.org Florence 920 32

In the future we could also work to /normalise/ the title - replacing
it with the page title that refers to the actual pageID. This won't
impact legacy files, and is currently blocked on the Apps team, but
should be viable as soon as that blocker goes away.

I've written a script capable of parsing and reformatting the legacy
files, so we should be able to backfill in this new format too, if
that's wanted (see below).

*The size constraints*

There really aren't any. Like I said, the historical rationale for a
lot of these decisions seems to have been keeping the files small. But
by putting requests to the same title from different site versions on
the same line, and dropping byte-count, we save enough space that the
resulting files are approximately the same size as the old ones - or
in many cases, actually smaller.

*What I'm asking for*

Feedback! What do people think of the new format? What would they like
to see that they don't? What don't they need, here? How useful would
normalisation be? How useful would backfilling be?

*What I'm not asking for*
WMF time! Like I said, this is a spare-time project; I've also got
volunteers for Code Review and checking, too (Yuvi and Otto).

The replacement of the old files! Too many people depend on that
format and that definition, and I don't want to make them sad.

Thoughts?

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Release]

2015-03-04 Thread Oliver Keyes

That is the question, and I agree with your conclusion. I'm hoping to
do more research into this; getting buyin internally has been tough,
but I'm confident of making progress on that front over the next few
weeks and months.

On 4 March 2015 at 04:13, Cristian Consonni kikkocrist...@gmail.com wrote:
 2015-03-04 8:44 GMT+01:00 Dario Taraborelli dtarabore...@wikimedia.org:
 yay, shiny! The map is a pretty compelling way to show how dominant traffic 
 from the US is, even for very minor languages (say bi.wikipedia.org), I 
 wonder how many requests from US-based bots/automata we’re still failing to 
 detect.

 Still, the question could be: are we fulfilling the mission?
 (hint: probably not)

 Cristian

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Release]

2015-03-04 Thread Oliver Keyes

On 4 March 2015 at 04:28, Pine W wiki.p...@gmail.com wrote:
 I'm not sure how much influence I have, but I would be happy to make
 whispers in appropriate places to try to get more support, if that's
 helpful.


I think I'm probably good, but thank you.

 Perhaps you could show your work at the next Research and Data showcase? I
 for one would be interested in seeing a presentation.

That's in 3 weeks; I'm not convinced that a piece of substantive,
useful research about global reach could be done in that time period
even if I could drop everything I currently have (which I can't). This
problem is too big and too important to be scheduled around meetings;
things should work the other way around.

Scott Hale and I have been working on a paper looking at global reach
and how it tracks with internet access growth, in the context of
editing, particularly looking at the mobile web. That, we should be
done with by then; presenting it could be highly useful (Scott? ;p)


 Pine

 This is an Encyclopedia
 One gateway to the wide garden of knowledge, where lies
 The deep rock of our past, in which we must delve
 The well of our future,
 The clear water we must leave untainted for those who come after us,
 The fertile earth, in which truth may grow in bright places, tended by many
 hands,
 And the broad fall of sunshine, warming our first steps toward knowing how
 much we do not know.
 —Catherine Munro



 On Wed, Mar 4, 2015 at 1:25 AM, Oliver Keyes oke...@wikimedia.org wrote:

 That is the question, and I agree with your conclusion. I'm hoping to
 do more research into this; getting buyin internally has been tough,
 but I'm confident of making progress on that front over the next few
 weeks and months.

 On 4 March 2015 at 04:13, Cristian Consonni kikkocrist...@gmail.com
 wrote:
  2015-03-04 8:44 GMT+01:00 Dario Taraborelli
  dtarabore...@wikimedia.org:
  yay, shiny! The map is a pretty compelling way to show how dominant
  traffic from the US is, even for very minor languages (say
  bi.wikipedia.org), I wonder how many requests from US-based bots/automata
  we’re still failing to detect.
 
  Still, the question could be: are we fulfilling the mission?
  (hint: probably not)
 
  Cristian
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Release]

2015-03-04 Thread Oliver Keyes

'Lots, but that's not currently anyone's job'

On Wednesday, 4 March 2015, Dario Taraborelli dtarabore...@wikimedia.org
wrote:

 yay, shiny! The map is a pretty compelling way to show how dominant
 traffic from the US is, even for very minor languages (say
 bi.wikipedia.org), I wonder how many requests from US-based bots/automata
 we’re still failing to detect.

  On Mar 3, 2015, at 9:29 PM, Oliver Keyes oke...@wikimedia.org
 javascript:; wrote:
 
  Update: the original Shiny instance went down due to server load soon
  after release. It's now up again at http://datavis.wmflabs.org/where/
  on a dedicated Labs machine, where we hope to put...many more
  visualisations. It also now has mapping, largely thanks to Sarah
  Laplante (http://sarahlaplante.com/), and soon it will hopefully be
  /non-hideous/ mapping (the current mass of blue and grey is because my
  aesthetic tastes are...I don't actually have any aesthetic tastes)
 
  On 2 March 2015 at 22:36, Oliver Keyes oke...@wikimedia.org
 javascript:; wrote:
  Indeed! Orienting it that way (pivoting on language rather than
  project) is something several people have asked for; I plan to spend a
  chunk of my spare time (that is, recreational time) trying to make it
  work. Should be fairly trivial.
 
  On 2 March 2015 at 09:55, h hant...@gmail.com javascript:; wrote:
  Hello Finn,
I do not have a specific answer to your question. However, it might
 be
  worthwhile to add Finnish in to the comparison as according to the
 CLDR 26
  T-L information
 
 http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html
 
You have some sizable Finnish language speakers in Sweden:
 
  Swedish {O} sv 95.0% 99.0%
  Finnish {OR} fi 2.2%
 
 So if the similar query is executed on Finnish language, and the
 results
  also show some undue proportion of visits from Sweden, then what you
  observed as anomaly is the that unique. We probably need many
 iterations of
  comparative outcomes and normalization of data (Sweden does have higher
  population).  Also, it might be handy to have some statistics on
 immigration
  or residence, it is EU. I will not be surprised that for example the
 visits
  from Oxford to Wikipedia website have sizable German language requests.
 
 I am still a bit bothered by the number 1 in the current dataset.
 It
  does not feel right since the numbers of 1.4% and 0.6% is a notable
  difference in this regard. Perhaps we need some high precision
 universal
  percentage number for each territory-language pair. It would be also
 great
  to do another set of aggregation: i.e. given a territory, which
 language
  versions of Wikipedia are accessed
 
  Best,
  han-teng liao
 
  2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen f...@imm.dtu.dk
 javascript:;:
 
  Hi Oliver,
 
 
  Interesting dataset! I am curious about why the Danish Wikipedia is so
  highly acccessed from Sweden. Could it be an error, e.g., with Telia
  IP-numbers?
 
  In Python:
 
  import pandas as pd
  df =
  pd.read_csv('
 http://files.figshare.com/1923822/language_pageviews_per_country.tsv',
  sep='\t')
  df.ix[df.project == 'da.wikipedia.org', ['country',
  'pageviews_percentage']].set_index('country') pageviews_percentage
  country
  Austria1
  China  1
  Denmark   61
  Estonia1
  France 1
  Germany2
  Netherlands2
  Norway 1
  Sweden18
  United Kingdom 3
  United States  3
  Other  5
 
 
  MaxMind has some numbers on their own accuracy:
 
  https://www.maxmind.com/en/geoip2-city-database-accuracy
 
  For Denmark 85% is Correctly Resolved, for Sweden only 68%. I
 wonder if
  this really could bias the result so much.
 
  If the numbers are correct why would the Swedish read the Danish
 Wikipedia
  so much? Bots? It does not apply the other way around: Only 2% of the
  traffic to Swedish Wikipedia comes from Denmark.
 
 
 
  best regards
  Finn
 
 
 
  On 02/25/2015 10:06 PM, Oliver Keyes wrote:
 
  Hey all!
 
  We've released a highly-aggregated dataset of readership data -
  specifically, data about where, geographically, traffic to each of
 our
  projects (and all of our projects) comes from. The data can be found
  at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally,
 I've
  put together an exploration tool for it at
  https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/
 
  Hope it's useful to people!
 
 
 
  --
  Finn Årup Nielsen
  http://people.compute.dtu.dk/faan/
 
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org javascript:;
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Release]

2015-03-03 Thread Oliver Keyes

Update: the original Shiny instance went down due to server load soon
after release. It's now up again at http://datavis.wmflabs.org/where/
on a dedicated Labs machine, where we hope to put...many more
visualisations. It also now has mapping, largely thanks to Sarah
Laplante (http://sarahlaplante.com/), and soon it will hopefully be
/non-hideous/ mapping (the current mass of blue and grey is because my
aesthetic tastes are...I don't actually have any aesthetic tastes)

On 2 March 2015 at 22:36, Oliver Keyes oke...@wikimedia.org wrote:
 Indeed! Orienting it that way (pivoting on language rather than
 project) is something several people have asked for; I plan to spend a
 chunk of my spare time (that is, recreational time) trying to make it
 work. Should be fairly trivial.

 On 2 March 2015 at 09:55, h hant...@gmail.com wrote:
 Hello Finn,
I do not have a specific answer to your question. However, it might be
 worthwhile to add Finnish in to the comparison as according to the CLDR 26
 T-L information
 http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html

You have some sizable Finnish language speakers in Sweden:

 Swedish {O} sv 95.0% 99.0%
 Finnish {OR} fi 2.2%

 So if the similar query is executed on Finnish language, and the results
 also show some undue proportion of visits from Sweden, then what you
 observed as anomaly is the that unique. We probably need many iterations of
 comparative outcomes and normalization of data (Sweden does have higher
 population).  Also, it might be handy to have some statistics on immigration
 or residence, it is EU. I will not be surprised that for example the  visits
 from Oxford to Wikipedia website have sizable German language requests.

 I am still a bit bothered by the number 1 in the current dataset. It
 does not feel right since the numbers of 1.4% and 0.6% is a notable
 difference in this regard. Perhaps we need some high precision universal
 percentage number for each territory-language pair. It would be also great
 to do another set of aggregation: i.e. given a territory, which language
 versions of Wikipedia are accessed

 Best,
 han-teng liao

 2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen f...@imm.dtu.dk:

 Hi Oliver,


 Interesting dataset! I am curious about why the Danish Wikipedia is so
 highly acccessed from Sweden. Could it be an error, e.g., with Telia
 IP-numbers?

 In Python:

  import pandas as pd
  df =
  pd.read_csv('http://files.figshare.com/1923822/language_pageviews_per_country.tsv',
  sep='\t')
  df.ix[df.project == 'da.wikipedia.org', ['country',
  'pageviews_percentage']].set_index('country') pageviews_percentage
 country
 Austria1
 China  1
 Denmark   61
 Estonia1
 France 1
 Germany2
 Netherlands2
 Norway 1
 Sweden18
 United Kingdom 3
 United States  3
 Other  5


 MaxMind has some numbers on their own accuracy:

 https://www.maxmind.com/en/geoip2-city-database-accuracy

 For Denmark 85% is Correctly Resolved, for Sweden only 68%. I wonder if
 this really could bias the result so much.

 If the numbers are correct why would the Swedish read the Danish Wikipedia
 so much? Bots? It does not apply the other way around: Only 2% of the
 traffic to Swedish Wikipedia comes from Denmark.



 best regards
 Finn



 On 02/25/2015 10:06 PM, Oliver Keyes wrote:

 Hey all!

 We've released a highly-aggregated dataset of readership data -
 specifically, data about where, geographically, traffic to each of our
 projects (and all of our projects) comes from. The data can be found
 at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
 put together an exploration tool for it at
 https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

 Hope it's useful to people!



 --
 Finn Årup Nielsen
 http://people.compute.dtu.dk/faan/


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Release]

2015-03-02 Thread Oliver Keyes

Indeed! Orienting it that way (pivoting on language rather than
project) is something several people have asked for; I plan to spend a
chunk of my spare time (that is, recreational time) trying to make it
work. Should be fairly trivial.

On 2 March 2015 at 09:55, h hant...@gmail.com wrote:
 Hello Finn,
I do not have a specific answer to your question. However, it might be
 worthwhile to add Finnish in to the comparison as according to the CLDR 26
 T-L information
 http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html

You have some sizable Finnish language speakers in Sweden:

 Swedish {O} sv 95.0% 99.0%
 Finnish {OR} fi 2.2%

 So if the similar query is executed on Finnish language, and the results
 also show some undue proportion of visits from Sweden, then what you
 observed as anomaly is the that unique. We probably need many iterations of
 comparative outcomes and normalization of data (Sweden does have higher
 population).  Also, it might be handy to have some statistics on immigration
 or residence, it is EU. I will not be surprised that for example the  visits
 from Oxford to Wikipedia website have sizable German language requests.

 I am still a bit bothered by the number 1 in the current dataset. It
 does not feel right since the numbers of 1.4% and 0.6% is a notable
 difference in this regard. Perhaps we need some high precision universal
 percentage number for each territory-language pair. It would be also great
 to do another set of aggregation: i.e. given a territory, which language
 versions of Wikipedia are accessed

 Best,
 han-teng liao

 2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen f...@imm.dtu.dk:

 Hi Oliver,


 Interesting dataset! I am curious about why the Danish Wikipedia is so
 highly acccessed from Sweden. Could it be an error, e.g., with Telia
 IP-numbers?

 In Python:

  import pandas as pd
  df =
  pd.read_csv('http://files.figshare.com/1923822/language_pageviews_per_country.tsv',
  sep='\t')
  df.ix[df.project == 'da.wikipedia.org', ['country',
  'pageviews_percentage']].set_index('country') pageviews_percentage
 country
 Austria1
 China  1
 Denmark   61
 Estonia1
 France 1
 Germany2
 Netherlands2
 Norway 1
 Sweden18
 United Kingdom 3
 United States  3
 Other  5


 MaxMind has some numbers on their own accuracy:

 https://www.maxmind.com/en/geoip2-city-database-accuracy

 For Denmark 85% is Correctly Resolved, for Sweden only 68%. I wonder if
 this really could bias the result so much.

 If the numbers are correct why would the Swedish read the Danish Wikipedia
 so much? Bots? It does not apply the other way around: Only 2% of the
 traffic to Swedish Wikipedia comes from Denmark.



 best regards
 Finn



 On 02/25/2015 10:06 PM, Oliver Keyes wrote:

 Hey all!

 We've released a highly-aggregated dataset of readership data -
 specifically, data about where, geographically, traffic to each of our
 projects (and all of our projects) comes from. The data can be found
 at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
 put together an exploration tool for it at
 https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

 Hope it's useful to people!



 --
 Finn Årup Nielsen
 http://people.compute.dtu.dk/faan/


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] [Release]

2015-02-25 Thread Oliver Keyes

Hey all!

We've released a highly-aggregated dataset of readership data -
specifically, data about where, geographically, traffic to each of our
projects (and all of our projects) comes from. The data can be found
at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
put together an exploration tool for it at
https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

Hope it's useful to people!

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Release]

2015-02-25 Thread Oliver Keyes

The one major caveat, I think, is that the danger of proportionate
data is that it makes small projects very vulnerable to artificial
traffic spikes. I'd go out on a limb and say that some of the massive
bumps in popularity we see in particular combinations are likely due
to either undetected automata or simply a project having so little
traffic that a small number of people can sway the results
outlandishly.

On 25 February 2015 at 16:32, Andrew Lih andrew@gmail.com wrote:
 Great job.

 Who knew Esperanto was big in Japan and China at #2 and #3?



 On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes oke...@wikimedia.org wrote:

 Hey all!

 We've released a highly-aggregated dataset of readership data -
 specifically, data about where, geographically, traffic to each of our
 projects (and all of our projects) comes from. The data can be found
 at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
 put together an exploration tool for it at
 https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

 Hope it's useful to people!

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Analytics] [Release]

2015-02-25 Thread Oliver Keyes

Totally! I'm also going to get together with some NEU hackers tomorrow
and work on actually visualising the data on *drumroll* maps, which'd
probably be more interesting eye candy than infinite bar plots :)

On 25 February 2015 at 16:19, Pine W wiki.p...@gmail.com wrote:
 Very nice. Do you think that you could pick out a few of your favorite
 graphs and add them to this week's Recent Research report in a gallery?

 Thanks!
 Pine

 Hey all!

 We've released a highly-aggregated dataset of readership data -
 specifically, data about where, geographically, traffic to each of our
 projects (and all of our projects) comes from. The data can be found
 at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've
 put together an exploration tool for it at
 https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

 Hope it's useful to people!

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Analytics mailing list
 analyt...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 ___
 Analytics mailing list
 analyt...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Analytics] [Release]

2015-02-25 Thread Oliver Keyes

Yours is looking at just December, while mine is looking at the entire
year, for starters. Also, what's the apps/mobile web inclusion for
that report?

On 25 February 2015 at 17:34, Erik Zachte ezac...@wikimedia.org wrote:
 I am surprised that the new data, with crawlers excluded, show more wp:en 
 traffic from US (43%) than the old data (36.4% for 2014), which contained 
 much crawler traffic, presumably most of that from US.

 Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and
 http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageBreakdown.htm

 Any thoughts?

 Erik

 -Original Message-
 From: analytics-boun...@lists.wikimedia.org 
 [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Oliver Keyes
 Sent: Wednesday, February 25, 2015 22:37
 To: Research into Wikimedia content and communities
 Cc: A mailing list for the Analytics Team at WMF and everybody who has an 
 interest in Wikipedia and analytics.
 Subject: Re: [Analytics] [Wiki-research-l] [Release]

 The one major caveat, I think, is that the danger of proportionate data is 
 that it makes small projects very vulnerable to artificial traffic spikes. 
 I'd go out on a limb and say that some of the massive bumps in popularity we 
 see in particular combinations are likely due to either undetected automata 
 or simply a project having so little traffic that a small number of people 
 can sway the results outlandishly.

 On 25 February 2015 at 16:32, Andrew Lih andrew@gmail.com wrote:
 Great job.

 Who knew Esperanto was big in Japan and China at #2 and #3?



 On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes oke...@wikimedia.org wrote:

 Hey all!

 We've released a highly-aggregated dataset of readership data -
 specifically, data about where, geographically, traffic to each of
 our projects (and all of our projects) comes from. The data can be
 found at http://dx.doi.org/10.6084/m9.figshare.1317408 -
 additionally, I've put together an exploration tool for it at
 https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

 Hope it's useful to people!

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Analytics mailing list
 analyt...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 analyt...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Scholarly citations by DOI in Wikipedia

2015-02-09 Thread Oliver Keyes

Sweet! Can I ask that we make the 2% explicitly available to wiki gnomes? :)

On Monday, 9 February 2015, Aaron Halfaker ahalfa...@wikimedia.org wrote:

 Hey folks,

 Dario and I just updated the scholarly citations dataset to include
 Digital Object Identifiers.  We found 742k citations (524k unique DOIs) in
 172k articles.  Our spot checking suggests that 98% of these DOIs resolve.
 The remaining 2% were extracted correctly, but they appear to be typos.

 http://dx.doi.org/10.6084/m9.figshare.1299540

 Like the dataset that we released for PubMed Identifiers, this dataset 
 includes
 the first known occurrence of a DOI citation in an English Wikipedia
 article and the associated revision metadata, based on the most recent
 complete content dump of English Wikipedia.

 Feel free to share this with anyone interested via:
 https://twitter.com/WikiResearch/status/564908585008627712

 We'll be organizing our own work and analysis of these citations here:
 https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia

 -Aaron



-- 
Sent from my mobile computing device of Lovecraftian complexity and horror.
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations

2015-02-06 Thread Oliver Keyes

We should be! We can sync templatelink or externalink entry timestamps
with revision table timestamps. Sounds like a fun project!

On 6 February 2015 at 16:15, Kerry Raymond kerry.raym...@gmail.com wrote:
 I agree it’s a good thing overall. I’m just alerting us to the potential
 problem it might create. I note it might not just be the academics
 themselves. In Australia at least, institutional research rankings are
 heavily based on citation counts. Our “Excellence in Research Assessment”
 (ERA) process creates massive institutional pressure to track down every
 possible citation, much of which is done by the library and admin teams. And
 with Wikipedia, it’s so easy to create a new citation … it’s hard to believe
 some people won’t be tempted …



 Are we able to extract the user names associated with adding links to
 academic papers? Some time downstream analysis of that data might be
 interesting, especially if there do appear to be clusters of cited papers
 with common author names added by the same user name or IP address. There’s
 almost certainly a publication in that! J



 But as I say, so long as the papers are actually relevant where they are
 cited in Wikipedia, this is not a bad thing for Wikipedia if academics do
 decide to promote their work that way.



 Kerry



 

 From: wiki-research-l-boun...@lists.wikimedia.org
 [mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of Aaron
 Halfaker
 Sent: Saturday, 7 February 2015 1:25 AM
 To: Research into Wikimedia content and communities
 Subject: Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations



 I agree that we could be doing something interesting with the social
 dynamics of Wikipedia editing by releasing this dataset -- and that some new
 problems may result.  However, I think that it's much better to have too
 much academic interest than not enough.  With a little AGF and diligence, we
 ought to be able to deal with this problem like we've dealt with quality
 control concerns in the past.  Academics have to be very careful about their
 reputation, and it's hard to cite your own unnecessarily without giving up
 who you are since your name's going to be on the paper.



 Either way, this is a useful dataset for library sciences work and it's
 public anyway.  We're just making it easier to work with.  Honestly, that's
 how I got started working in this space -- helping someone get data for
 their own research.



 -Aaron



 On Fri, Feb 6, 2015 at 5:50 AM, mjn m...@anadrome.org wrote:

 I agree it's not a new worry, but it might change the nature of the
 problem a bit, and is worth at least being vigilant about. I did have a
 similar idea some years ago, to compute an impact factor for
 being-cited-on-Wikipedia, but after discussing it with some colleagues,
 didn't do so specifically because of the worry that it would encourage
 more gaming of Wikipedia citations. Of course it's inevitable that
 someone would eventually do it, but I still think it was probably right
 on balance to not push that date forward.

 Regarding the SEO analogy, the external links on Wikipedia are on
 average not the best part of Wikipedia, so it's not a very heartening
 The citations for now are not nearly as spammy as the external links
 are, and I hope it stays that way!

 It's of course not new that there is an incentive to spam citations.
 Even without explicit Wikipedia-citation-tracking, there are incentives
 to spam marginally relevant citations in order to increase perceived
 prominence. Maybe being in a Wikipedia article will get your paper in
 front of more grad students who will end up citing it for real after
 encountering it on Wikipedia, etc. A direct citation count feels like
 it's likely to exacerbate that, since now removing an irrelevant
 citation to someone's article is a direct attack on their metrics!
 Though it's possible the actual effect on editing patterns will be
 small.

 From a research perspective, the new datasets of citations might be
 interesting to track over time, and correlate back to editors, to see if
 there are any interesting (or interesting) patterns.

 -Mark

 --
 mjn | http://www.anadrome.org

 Oliver Keyes ironho...@gmail.com writes:

 And SEO spammers will add themselves, too! This is not a new problem.

 On Thursday, 5 February 2015, Kerry Raymond kerry.raym...@gmail.com
 wrote:

  Do I understand this correctly? That Wikipedia articles that cite
 academic publications will be included in citation count now (at least
 for
 altmetrics). While that’s great recognition for Wikipedia as a corpus of
 scholarly work, does that mean Wikipedia will be overrun with academic
 authors adding citations to their academic papers in any Wikipedia
 article
 they can get away with in order to improve their citation counts for
 their
 CVs?



 I note that generally we can spot self-citation because the two papers
 will have an author name in common, but with the ability to edit

Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations

2015-02-06 Thread Oliver Keyes

It also requires SEO people to demonstrate a modicum of logical reasoning
skills. Sadly, from my work on understanding our traffic trends, this
appears to be beyond at least some of them.

On Friday, 6 February 2015, Laura Hale la...@fanhistory.com wrote:

 That's actually a Wikipedia thing, by putting

 a rel=nofollow class=external text

 in the article source code. Internal articles in contrast say

 a href= class=internal



 That's not a Google good will thing.

 Sincerely,
 Laura Hale

 On Fri, Feb 6, 2015 at 9:44 PM, Kerry Raymond kerry.raym...@gmail.com
 javascript:_e(%7B%7D,'cvml','kerry.raym...@gmail.com'); wrote:

 I thought that Wikipedia addressed the SEO problem by getting Google to
 not
 follow the off-wiki links when crawling, so that Wikipedia's page rank
 would
 not follow through to off-Wikipedia links. But I cannot (using Google)
 find
 the page where I read that.

 While that doesn't prevent people from spamming Wikipedia with external
 links to catch people's eyeballs while reading Wikipedia, it should
 address
 the SEO problem somewhat.

 Kerry



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org');
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 twitter: purplepopple



-- 
Sent from my mobile computing device of Lovecraftian complexity and horror.
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations

2015-02-05 Thread Oliver Keyes

And SEO spammers will add themselves, too! This is not a new problem.

On Thursday, 5 February 2015, Kerry Raymond kerry.raym...@gmail.com wrote:

  Do I understand this correctly? That Wikipedia articles that cite
 academic publications will be included in citation count now (at least for
 altmetrics). While that’s great recognition for Wikipedia as a corpus of
 scholarly work, does that mean Wikipedia will be overrun with academic
 authors adding citations to their academic papers in any Wikipedia article
 they can get away with in order to improve their citation counts for their
 CVs?



 I note that generally we can spot self-citation because the two papers
 will have an author name in common, but with the ability to edit Wikipedia
 anonymously and pseudonymously means that we cannot spot self-citation.



 While judging research purely on citation counts is a deeply flawed method
 of assessment, nonetheless it is a reality and the pressure on folks to
 “game” the system is tremendous given the role it can play in appointment,
 tenure, promotion and grant applications.



 On the positive side, we might be able to get rid of a lot of
 citation-needed tags.



 Kerry


  --

 *From:* wiki-research-l-boun...@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org');
 [mailto:wiki-research-l-boun...@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org');]
 *On Behalf Of *Pine W
 *Sent:* Friday, 6 February 2015 8:13 AM
 *To:* Wiki Research-l; Raymond Leonard; Wikimedia  GLAM collaboration
 [Public]; North American Cultural Partnerships
 *Subject:* [Wiki-research-l] Altmetric.com now tracks Wikipedia citations



 FYI:

 http://www.altmetric.com/blog/new-source-alert-wikipedia/

 Pine

 This is an Encyclopedia https://www.wikipedia.org/








 * One gateway to the wide garden of knowledge, where lies The deep rock of
 our past, in which we must delve The well of our future, The clear water we
 must leave untainted for those who come after us, The fertile earth, in
 which truth may grow in bright places, tended by many hands, And the broad
 fall of sunshine, warming our first steps toward knowing how much we do not
 know. —Catherine Munro *



-- 
Sent from my mobile computing device of Lovecraftian complexity and horror.
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] R library for URL handling

2015-01-21 Thread Oliver Keyes

Also, version 1.2.0 of the R MW API client library -
https://github.com/Ironholds/WikipediR

(what can I say, being semi-bedridden makes me a productive little gnome)

On 21 January 2015 at 18:47, Oliver Keyes oke...@wikimedia.org wrote:
 Possibly of interest to any researchers who work with our
 pageview/requests data:

 I've just release v1.0.0 of urltools,[0] a library that provides very
 very fast vectorised URL decoding and parsing. Might be useful for the
 useRs in our community! See the associated vignette for
 functionality.[1]

 [0] https://github.com/Ironholds/urltools
 [1] https://github.com/Ironholds/urltools/blob/master/vignettes/urltools.Rmd

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] R library for URL handling

2015-01-21 Thread Oliver Keyes

Possibly of interest to any researchers who work with our
pageview/requests data:

I've just release v1.0.0 of urltools,[0] a library that provides very
very fast vectorised URL decoding and parsing. Might be useful for the
useRs in our community! See the associated vignette for
functionality.[1]

[0] https://github.com/Ironholds/urltools
[1] https://github.com/Ironholds/urltools/blob/master/vignettes/urltools.Rmd

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-14 Thread Oliver Keyes

On Wed, Jan 14, 2015 at 3:39 AM, John Mark Vandenberg jay...@gmail.com wrote:

On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes ironho...@gmail.com wrote:
I'm confused; john, could you point to the element of the collected data
that isn't collected already by default in any Nginx or Apache setup? I
agree that there might be a lack of user expectation, but 'silently
capturing behavioral data' seems somewhat hyperbolic to describe what's
actually going on.

The proposed element to be added is geolocation below country level.
Default Nginx and Apache log formats do not include geolocation.
Which is why this research proposal exists and is being discussed, and
rightly so.

Gotcha: I thought you were referring to the information we already have.

fwiw, the Nginx geoip module is not even included, by default, when
compiling the source code.

As the paper explicitly describes, and is a common theme in research
proposals, Wikimedia access log information is user reading behaviour
being captured.

The old privacy and data retention policies gave users the expectation
that access log data was destroyed after a set period, assumed to be
only three months as that was the limit of Checkuser visibility. The
current policies are more like yes we collect a lot of data about
users, using tracking technology, and please trust us. And sorry we
dont honour 'Dont track us', as we presumed that you trust us and the
researchers that we allow to access our analytics.

We should be planning for what will be the effect when the WMF servers
are hacked and _all_ of the analytics data is now in the hands of a
repressive government or similar. Or, imagine the WMF sends the
analytics data across an insecure link which is tapped and the data
reconstructed, either due to not using secure links at all, or an
accidental routing problem.
https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html

The geolocation proposal is to perform it over IP addresses...which
are already stored. So, the only major difference between hacking
now and hacking later is that doing it later means you don't have to
spend 99 bucks on a geolocation hashtable.

If/When that day comes, hopefully they don't have much data to make
inferences from, and what data they obtain can be well justified.

Having a quick peak, I thought it was odd that browser Wikimedia sites
now causes impressions to be sent back to the WMF servers with the
country of the user included. This is a workaround to simplify
analytics.
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee8775a5df9f68857a337efadbb2b5d36811f1a/special%2FSpecialRecordImpression.php

CentralNotice and the fundraising banners have done this for
absolutely years, yes; that's the code you're looking at.

The more you collect, especially using multiple systems to collect
similar data, the more likely that if subpoenaed, WMF's various
datasets could be used to infer a pretty reliable answer to which
days in 2013 was John Vandenberg in Indonesia?, or when did John
Vandenberg first read the Wikipedia article about bomb making
ingredient? The more you publish, even aggregated, the more likely
these types of questions can be inferred without a subpoena, at least
for users with large enough lists of public contributions, by
scientists like yourself with lots of computation power and plenty of
time on their hands rifling through the data to *infer* the identify
of editors, and if it is a government body they also have lots of
other datasets which can be used to assist in the task.

Yep, and that's why we're discussing this.

Adding fine-grained geolocation information to published page views is
an example of the latter and the paper wisely suggests not including
logged in users as a possible solution to some of the privacy issues.

There is also the problem that many IPs can be easily inferred to be a
single cohort of people in some situations. e.g. in regions where the
only large collection of computers is an single facility, e.g. a
school. In a repressive regime especially, that could lead to
official questions being asked like: why were so many students at this
school reading about blah on date. And teachers being identified
as responsible, etc.

The paper considers IP users vs logged in users to be a binary set.
However there are tools built which exploit the fact that logged in
users make a logged out edit which identifies their IP. Add
geolocation of pageviews and we can infer the probability that other
IPs in their smallest geolocation block are also likely to be edits by
the same person, as the algorithm in the paper leaks 'number of active
editors in each region each day'.

No, it doesn't: the proposal is to aggregate. Where there are few
observations (or little variation in observations) within a geographic
region, the data will be moved up one level and aggregated, and so on
until a sufficient degree

Re: [Wiki-research-l] How many links did TWL account recipients add to Wikipedia with their access?

2015-01-14 Thread Oliver Keyes

Actually, the API structure has a grab all of the external links from
[page] query: I'm not sure if it can be applied to historical revisions,
but we can see!

On Wed, Jan 14, 2015 at 2:23 AM, Gerard Meijssen gerard.meijs...@gmail.com
wrote:

 Hoi,
 These same people may have added content to Wikidata ... Obviously it has
 not been considered. However, you can query for these people there. You can
 also query how many external references were added by bot. It may provide
 the groundwork going to Wikipedias and find who did it .. references to
 external sources are added all the time.
 Thanks,
   GerardM

 On 14 January 2015 at 00:23, Aaron Halfaker ahalfa...@wikimedia.org
 wrote:

 I don't think that you can do this with quarry since you'll need to parse
 wiki content in order to extract external links.  I don't think they are
 stored in a table anywhere.

 However, I think that we can do it fairly easily with a process on the
 XML dumps.  If you can give me the list of editors and partner websites, I
 could put a script together and talk to you about the bits.

 -Aaron

 On Tue, Jan 13, 2015 at 5:01 PM, Jake Orlowitz jorlow...@gmail.com
 wrote:

 Hi all,

 There are 2000 editors who have received access to 20 different online
 databases. We know the usernames of these editors and the url prefixes of
 the websites they were given access to.

 We need to know:
 - from July 18th 2014 to January 11th 2014
 - on English Wikipedia
 - for the cohort of 2000 TWL editors
 - ...how many times did they add links to any of the 20 partner websites

 I have my fingers crossed that Quarry can solve this but I need some
 help to write a query.

 Bonus queries:
 1) In that date range, how many links did these editors add using
 partner websites on *all Wikipedias* (any language)
 2) What is the baseline change in all external links on English
 Wikipedia in that date range
 3) What is the baseline change in all external links on *all Wikipedias*
 since July 18th 2014

 Thanks so much for any guidance on this!

 Jake (Ocaasi)
 The Wikipedia Library

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-13 Thread Oliver Keyes

I'm confused; john, could you point to the element of the collected data
that isn't collected already by default in any Nginx or Apache setup?  I
agree that there might be a lack of user expectation, but 'silently
capturing behavioral data' seems somewhat hyperbolic to describe what's
actually going on.

On Tuesday, 13 January 2015, John Mark Vandenberg jay...@gmail.com wrote:

 On Wed, Jan 14, 2015 at 9:22 AM, Andrew Gray andrew.g...@dunelm.org.uk
 javascript:; wrote:
  Fair enough - I don't use it, and I think I'd got entirely the wrong
  end of the stick on what it's for! If it's intended to stop tracking
  by third-party sites then it certainly seems to be of little relevance
  here.

 I think you're right to be concerned about this.
 It is about expectations; people do not expect a NGO providing an
 encyclopedia to be silently capturing reading behaviour data.

 If the data is provided to other entities, even for noble research
 objectives, people expect Do Not Track to cover this.

 https://cyberlaw.stanford.edu/node/6573

 --
 John Vandenberg

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Sent from my mobile computing device of Lovecraftian complexity and horror.
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Endowment perpetuity

2015-01-12 Thread Oliver Keyes

Speaking of questions that open with an agenda, have you tried asking
the fundraising team?

On 12 January 2015 at 20:07, James Salsman jsals...@gmail.com wrote:
 Speaking of fundraising far over budget, did the question about an endowment
 perpetuity make it on to the last donor survey? If so, what was the result?
 I seem to remember a favorable response, from somewhere, but can't find
 anything either way. Was it a Board poll?


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: $55 million raised in 2014

2015-01-02 Thread Oliver Keyes

Bah; dropped a digit when reading the y-axis. My bad. My concerns about
straight extrapolation for this model remain, however.

On 2 January 2015 at 11:31, Oliver Keyes oke...@wikimedia.org wrote:

 3 billion being...above the upper bound of the extrapolation you've made?
 Uh-huh.

 Extrapolation is not a particularly useful method to use for the budget,
 because it assumes endless exponential growth. I can see the budget
 increasing due to us increasingly taking on the responsibilities we've
 previously been unable to do anything about, but I can't see what we'd
 actually /do/ with 3 billion dollars (although if we want to expand the
 Hadoop cluster with most of that I would, of course, be most grateful ;p)

 On 2 January 2015 at 04:23, Gerard Meijssen gerard.meijs...@gmail.com
 wrote:

 Hoi,
 It is known that education is a great way to eradicate poverty. We know
 that Wikipedia brings information and is educational. When the effect of
 your 3 billion dollar brings education and effectively helps to eradicate
 poverty it is well worth it.

 No irony intended.
 Thanks,
   GerardM

 On 2 January 2015 at 09:11, James Salsman jsals...@gmail.com wrote:

 In ten years time, I predict the Foundation will raise $3 billion:
 http://i.imgur.com/hdoAIan.jpg


 -- Forwarded message --
 From: James Salsman jsals...@gmail.com
 Date: Thu, Jan 1, 2015 at 9:01 PM
 Subject: $55 million raised in 2014
 To: Wikimedia Mailing List wikimedi...@lists.wikimedia.org


 Happy new year: http://i.imgur.com/faPsI9J.jpg

 Source: http://frdata.wikimedia.org/yeardata-day-vs-ytdsum.csv

 I don't mind the banners, although I am still saddened that several
 hundred editor-submitted banners remain untested from six years
 ago, when the observed variance in the performance of those that were
 tested indicates that there are likely at least 15 which would do
 better than any of those which were tested. Why the heck is the
 fundraising team still ignoring all those untested submissions?

 But as to the intrusiveness of the banners, I would rather have
 fade-in popups with fuschia blinkmarquee text on a epileptic
 seizure-inducing background and auto-play audio than have the
 fundraising director claim that donations are decreasing to help
 justify narrowing scope.

 Best regards,
 James Salsman

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: $55 million raised in 2014

2015-01-02 Thread Oliver Keyes

On 2 January 2015 at 15:08, James Salsman jsals...@gmail.com wrote:

 Oliver Keyes wrote:
 
 ... Extrapolation is not a particularly useful method to use for
  the budget, because it assumes endless exponential growth.

 I agree. Formal budgeting usually shouldn't extend further than three
 to five years in the nonprofit sector (long-term budgeting is
 unavoidable in government and some industry.)  However, here are a
 couple illustrations of some reasons I believe a ten year
 extrapolation of Foundation fundraising is completely reasonable:
 http://imgur.com/a/mV72T


Words tend to be more useful than contextless images.


 ... I can't see what we'd actually /do/ with 3 billion dollars

 I used to be in favor of a establishing an endowment with a sufficient
 perpetuity, and then halting fundraising forever, but I have changed
 my mind. I think the Foundation should continue to raise money
 indefinitely to pay people for this task:
 https://meta.wikimedia.org/wiki/Grants:IEG/Revision_scoring_as_a_service

 That is equivalent to a general computer-aided instruction system,
 with the side effects of both improving the encyclopedia and making
 counter-vandalism bots more accurate. As an anonymous crowdsourced
 review system based on consensus voting instead of editorial
 judgement, it leaves the Foundation immunized with their safe harbor
 provisions regarding content control intact.


It's also not worth 3 billion dollars (no offence, Aaron!) as evidenced by
the fact that it can be established with 20k.

This is not a discussion for research-l, this is a discussion for (at best)
Wikimedia-l - and I have to say that I don't feel it's at all useful even
/there/, but it is at least in context. Spending time discussing
pie-in-the-sky what would we do if we had 3 billion dollars ideas is all
well and nice, but I prefer to think that time is better spent doing
research with the resources we have now, and editing with the resources we
have now, and making pitches for additional resources as and when they
become available. So on that note: I'm going to go off and do that.

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Fwd: $55 million raised in 2014

2015-01-02 Thread Oliver Keyes

On 2 January 2015 at 18:14, James Salsman jsals...@gmail.com wrote:

 ... This is not a discussion for research-l

 On the contrary, please see e.g.
 http://www.wikisym.org/os2014-files/proceedings/p609.pdf
 this Foundation-sponsored IEG effort can serve as a confirmatory
 replication of that prior work.


Let me rephrase, because I evidently wasn't clear: fanciful and unrealistic
discussions of what we'd do with a pot of money that wouldn't be available
for a decade and only exists in the first place if you assume exponential
growthare not for research-l


 ... time is better spent doing research with the resources
  we have now

 I wish someone would please replicate my measurement of the variance
 in the distribution of fundraising results using the editor-submitted
 banners from 2008-9, and explain to the fundraising team that
 distribution implies they can do a whole lot better than sticking with
 the spiel which degrades Foundation employees by implying they
 typically spend $3 or £3 on coffee. (Although I wouldn't discount the
 possibility that some donors feel good about sending Foundation
 employers to boutique coffee shops.)

 We know donor message- and banner-fatigue exists as a strong effect
 which limits the useful life of fundraising approaches in some cases,
 so they have to keep trying to keep up. When are they going to test
 the remainder of the editors' submissions?


Given that you've been asking for that analysis for four years, and it's
never been done, and you've been repeatedly told that it's not going to
happen, could youtake those hints? And by hints, I mean explicit
statements.
 I appreciate that you're operating in good faith, but there comes a point
when http://wondermark.com/1k62/ starts proving that life imitates art.
Repeatedly having this same conversation is a colossal, ever-draining waste
of everyone's time. Please stop bringing it up.


-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Looking for reader's click log data for Wikipedia

2014-12-28 Thread Oliver Keyes

Afraid not. First, we do not have some of those datapoints; we do not
currently have unique user IDs. And, second, it would be a tremendous
ethical violation for us to release that data that we /do/ have (IP
addresses, for example).

On 28 December 2014 at 21:00, Ditty Mathew ditty...@gmail.com wrote:

 Hi,

 Is the reader's click log data(should contain user id/ip, article title,
 timestamp) is available for Wikipedia.

 with regards

 Ditty

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Looking for reader's click log data for Wikipedia

2014-12-28 Thread Oliver Keyes

I'm not exactly sure how one provides an anonymised dataset that contains
IP addresses. But:

We don't have those navigation paths and so can't provide them. Sure, we
could provide the {referer, URL} tuples associated with specific IP
addresses, and replace the IP with some kind of randomly-generated value
(or just a salted hash) but this falls apart very quickly with the modern
structure of the internet and the scale Wikimedia properties operate on:
you can have a lot of distinct people at one IP address, particularly
through cellular networks, and so multiple sessions and trails can get
inaccurately grouped together. More importantly, the HTTPS protocol
involves either sanitising or completely stripping referers, rendering
those chains impossible to reconstruct.

I believe Leila Zia and Bob West (who will hopefully see this message. I
know Leila is on this list!) are currently working on a project that looks
at search paths, and they may have additional commentary. But
generally-speaking: we do not generate this data as a matter of course, we
would not be comfortable releasing it (unless exceedingly sanitised), and
as the person who deals with our request logs on a day-to-day basis I can
think of a half-dozen ways in which it would produce false results (ways we
have no real way of checking the probability of occurring).

On 28 December 2014 at 22:53, Ditty Mathew ditty...@gmail.com wrote:

 The exact user information is not needed. The anonymized data is enough.
 What exactly we need is the navigation path of Wikipedia readers.

 with regards

 Ditty

 On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes oke...@wikimedia.org
 wrote:

 Afraid not. First, we do not have some of those datapoints; we do not
 currently have unique user IDs. And, second, it would be a tremendous
 ethical violation for us to release that data that we /do/ have (IP
 addresses, for example).

 On 28 December 2014 at 21:00, Ditty Mathew ditty...@gmail.com wrote:

 Hi,

 Is the reader's click log data(should contain user id/ip, article title,
 timestamp) is available for Wikipedia.

 with regards

 Ditty

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Editor sessions and related metrics

2014-12-16 Thread Oliver Keyes

Hey all,

Not sure if this would be interesting to researchers or community members,
but: you might remember a paper Stuart and Aaron did a while ago about
measuring edit sessions -
http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Measure_Participation_in_Wikipedia/geiger13using-preprint.pdf

To me it's really interesting, because it's (as much as anything else) a
new metric for measuring participation, and a metric we can extract
additional metrics from (e.g., session length).

As part of some related work on /reader/ sessions, I wrote a pile of code
to handle session reconstruction. I've generalised it (it doesn't care if
you've got reader timestamps, editor timestamps, or best buy receipt
timestamps) and thrown it up at https://github.com/Ironholds/reconstructr .
I figure it could be useful to any researchers or community members looking
into sessions.

Thanks,

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Editor sessions and related metrics

2014-12-16 Thread Oliver Keyes

Totally; already threw it at the internal research list :)

On 16 December 2014 at 14:37, Toby Negrin tneg...@wikimedia.org wrote:

 Awesome work! Can we distribute in the foundation?

 On Tue, Dec 16, 2014 at 10:40 AM, Oliver Keyes oke...@wikimedia.org
 wrote:

 Hey all,

 Not sure if this would be interesting to researchers or community
 members, but: you might remember a paper Stuart and Aaron did a while ago
 about measuring edit sessions -
 http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Measure_Participation_in_Wikipedia/geiger13using-preprint.pdf

 To me it's really interesting, because it's (as much as anything else) a
 new metric for measuring participation, and a metric we can extract
 additional metrics from (e.g., session length).

 As part of some related work on /reader/ sessions, I wrote a pile of code
 to handle session reconstruction. I've generalised it (it doesn't care if
 you've got reader timestamps, editor timestamps, or best buy receipt
 timestamps) and thrown it up at https://github.com/Ironholds/reconstructr
 . I figure it could be useful to any researchers or community members
 looking into sessions.

 Thanks,

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Pageviews, mobile versus desktop

2014-12-15 Thread Oliver Keyes

Yep; same timeframe.

On 15 December 2014 at 12:50, Federico Leva (Nemo) nemow...@gmail.com
wrote:

 Oliver Keyes, 13/12/2014 21:15:

 http://ironholds.org/misc/pageviews_year_and_week.png - fascinating! It
 reveals a lot of seasonality in the desktop views - again, not
 replicated on mobile (at least, not so strongly)


 Does this graph also go from 2013-02-01 to 2014-12-01?

 Nemo


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote)

2014-12-15 Thread Oliver Keyes

Communities are who they choose to offer plaudits to. The people getting
off largely scott-free in this case are people a highly vocal subgroup has
put on a pedestal for years. I don't know if the community as a whole is
that ugly and bitter, but I can understand where people would get the
impression, from looking at the sort of person we celebrate.

On Monday, 15 December 2014, WereSpielChequers werespielchequ...@gmail.com
wrote:

 We have problems, I don't dispute that. But ugly and bitter as 4chan?
 That has to be an exaggeration.

 Regards

 Jonathan Cardy


 On 13 Dec 2014, at 01:03, Andrew Lih andrew@gmail.com
 javascript:_e(%7B%7D,'cvml','andrew@gmail.com'); wrote:

 I certainly hope you're right Sydney. What a horrible mess.


 On Fri, Dec 12, 2014 at 5:53 PM, Sydney Poore sydney.po...@gmail.com
 javascript:_e(%7B%7D,'cvml','sydney.po...@gmail.com'); wrote:

 I think feminists, especially those who take an interest in STEM, will
 pass this article around.

 Sydney
 On Dec 12, 2014 5:35 PM, Andrew Lih andrew@gmail.com
 javascript:_e(%7B%7D,'cvml','andrew@gmail.com'); wrote:

 It's a good piece, but honestly I think only the dedicated tech reader
 will make it through the entire story. There's a lot of jargon and insider
 intrigue such that I could imagine most people never making past the
 typewriter barf of BLP, AGF, NOR :)


 On Fri, Dec 12, 2014 at 5:26 PM, Dariusz Jemielniak dar...@alk.edu.pl
 javascript:_e(%7B%7D,'cvml','dar...@alk.edu.pl'); wrote:

 While I agree that the article is overly negative (likely because of
 the individual experience), I think it still points to an important
 problem. I don't perceive this article as really problematic in terms of
 image. Maybe naively, I imagine that people will not stop donating because
 the community is not ideal.

 pundit

 On Fri, Dec 12, 2014 at 11:16 PM, Kerry Raymond 
 kerry.raym...@gmail.com
 javascript:_e(%7B%7D,'cvml','kerry.raym...@gmail.com'); wrote:

  There’s a saying that everyone likes to eat sausages but nobody
 likes to know how they are made.  It is not good to have negative 
 publicity
 like that during the annual donation campaign (irrespective of the
 motivations of the journalist and/or the rights/wrongs of the issue being
 reported, neither of which I intend to debate here). As a donation-funded
 organisation, public perception matters a lot.



 Kerry


  --

 *From:* Jonathan Morgan [mailto:jmor...@wikimedia.org
 javascript:_e(%7B%7D,'cvml','jmor...@wikimedia.org');]
 *Sent:* Saturday, 13 December 2014 6:43 AM
 *To:* Research into Wikimedia content and communities
 *Cc:* Kerry Raymond
 *Subject:* Re: [Wiki-research-l] commentary on Wikipedia's community
 behaviour (Aaron gets a quote)



 I mostly agree. On one hand, it's always nice to see a detailed
 description of how wiki-sausage gets made in a major venue. On the other,
 this journalist clearly has a personal axe to grind, and used his bully
 pulpit to grind it in public.



 - J



 On Fri, Dec 12, 2014 at 1:39 AM, Federico Leva (Nemo) 
 nemow...@gmail.com
 javascript:_e(%7B%7D,'cvml','nemow...@gmail.com'); wrote:

 1000th addition to the inconsequential rant genre.

 Nemo



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org');
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --

 Jonathan T. Morgan

 Community Research Lead

 Wikimedia Foundation

 User:Jmorgan (WMF)
 https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)

 jmor...@wikimedia.org
 javascript:_e(%7B%7D,'cvml','jmor...@wikimedia.org');



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org');
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --

 __
 prof. dr hab. Dariusz Jemielniak
 kierownik katedry Zarządzania Międzynarodowego
 i centrum badawczego CROW
 Akademia Leona Koźmińskiego
 http://www.crow.alk.edu.pl

 członek Akademii Młodych Uczonych Polskiej Akademii Nauk
 członek Komitetu Polityki Naukowej MNiSW

 Wyszła pierwsza na świecie etnografia Wikipedii Common Knowledge? An
 Ethnography of Wikipedia (2014, Stanford University Press) mojego
 autorstwa http://www.sup.org/book.cgi?id=24010

 Recenzje
 Forbes: http://www.forbes.com/fdc/welcome_mjx.shtml
 Pacific Standard:
 http://www.psmag.com/navigation/books-and-culture/killed-wikipedia-93777/
 Motherboard:
 http://motherboard.vice.com/read/an-ethnography-of-wikipedia
 The Wikipedian:
 http://thewikipedian.net/2014/10/10/dariusz-jemielniak-common-knowledge

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org');

Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Oliver Keyes

Oh dear god, that would be incredible.

The non-streaming API has a wonderful bug: if you request a series of
diffs, and there are 1 uncached diffs in that series, only the first
uncached diff will be returned. For the rest it returns...an error? No.
Some kind of special value? No. It returns an empty string. You know: that
thing it also returns if there is no difference . So instead you stream
edits and compute the diffs yourself and everything goes a bit Pete Tong.
Having this service around would be a lifesaver.

On 13 December 2014 at 10:14, Scott Hale computermacgy...@gmail.com wrote:

 Great idea, Yuvi. Speaking as someone who just downloaded diffs for a
 month of data from the streaming API for a research project, I certainly
 could see an 'augmented stream' with diffs included being very useful for
 research and also for bots.


 On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda yuvipa...@gmail.com wrote:

 On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote:
  If a lot of people are doing this, then perhaps it makes sense to have
  an 'augmented real time streaming' interface that is an exact replica
  of the streaming interface but with diffs added.

 Or rather, if I were to build such a thing, would people be interested
 in using it?

 --
 Yuvi Panda T
 http://yuvi.in/blog

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --
 Scott Hale
 Oxford Internet Institute
 University of Oxford
 http://www.scotthale.net/
 scott.h...@oii.ox.ac.uk

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Pageviews, mobile versus desktop

2014-12-13 Thread Oliver Keyes

A graph I just generated while messing around with the high-granularity
data we used in the monthly metrics readership report:
http://ironholds.org/misc/pageviews_trends.png

The thing I find really interesting about this is not the trend (mobile up,
desktop down. As Lehrer said, this we know from nothing!) but the patterns.
Mobile clusters far more tightly than desktop does.

I'm not sure what this means (desktop users are weird? There's a lot of bot
traffic we're not catching? That's my guess) but I thought it was pretty
and might provoke some hypothesising. So, here you go!

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Pageviews, mobile versus desktop

2014-12-13 Thread Oliver Keyes

Bah, you're right! Will reupload.

Pageviews are bucketed by UTC day, although the axis is by months to avoid
making it essentially unreadable. It's generated in ggplot2 using
theme_bw() (one of my favourite combinations)

On 13 December 2014 at 12:33, Ed Summers e...@pobox.com wrote:


  On Dec 13, 2014, at 12:18 PM, Oliver Keyes oke...@wikimedia.org wrote:
 
  I'm not sure what this means (desktop users are weird? There's a lot of
 bot traffic we're not catching? That's my guess) but I thought it was
 pretty and might provoke some hypothesising. So, here you go!

 I think the axis labels are flipped? How are the page views bucketed: day,
 week, month, something else?

 It is a pretty  clean looking graph, what did you use to generate it?

 //Ed

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Pageviews, mobile versus desktop

2014-12-13 Thread Oliver Keyes

Ooh, that's a really good point. In fact, we know there's different
behaviour - mobile rises on weekends, desktop falls, but the desktop fall 
the mobile rise. I'm knee-deep in adjusted R2 values right now but I'll
visualise that way and see what happens :)

On 13 December 2014 at 13:17, Ed Summers e...@pobox.com wrote:

 It might be interesting to bucket by week to see if you still see the
 difference in clustering between desktop and mobile. I wonder if it’s a
 result of different behavior on desktop/mobile on weekdays/weekends?

 //Ed

  On Dec 13, 2014, at 12:37 PM, Oliver Keyes oke...@wikimedia.org wrote:
 
  Bah, you're right! Will reupload.
 
  Pageviews are bucketed by UTC day, although the axis is by months to
 avoid making it essentially unreadable. It's generated in ggplot2 using
 theme_bw() (one of my favourite combinations)
 
  On 13 December 2014 at 12:33, Ed Summers e...@pobox.com wrote:
 
   On Dec 13, 2014, at 12:18 PM, Oliver Keyes oke...@wikimedia.org
 wrote:
  
   I'm not sure what this means (desktop users are weird? There's a lot
 of bot traffic we're not catching? That's my guess) but I thought it was
 pretty and might provoke some hypothesising. So, here you go!
 
  I think the axis labels are flipped? How are the page views bucketed:
 day, week, month, something else?
 
  It is a pretty  clean looking graph, what did you use to generate it?
 
  //Ed
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 
 
 
  --
  Oliver Keyes
  Research Analyst
  Wikimedia Foundation
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Pageviews, mobile versus desktop

2014-12-13 Thread Oliver Keyes

http://ironholds.org/misc/pageviews_year_and_week.png - fascinating! It
reveals a lot of seasonality in the desktop views - again, not replicated
on mobile (at least, not so strongly)

On 13 December 2014 at 13:49, Oliver Keyes oke...@wikimedia.org wrote:

 Ooh, that's a really good point. In fact, we know there's different
 behaviour - mobile rises on weekends, desktop falls, but the desktop fall 
 the mobile rise. I'm knee-deep in adjusted R2 values right now but I'll
 visualise that way and see what happens :)

 On 13 December 2014 at 13:17, Ed Summers e...@pobox.com wrote:

 It might be interesting to bucket by week to see if you still see the
 difference in clustering between desktop and mobile. I wonder if it’s a
 result of different behavior on desktop/mobile on weekdays/weekends?

 //Ed

  On Dec 13, 2014, at 12:37 PM, Oliver Keyes oke...@wikimedia.org
 wrote:
 
  Bah, you're right! Will reupload.
 
  Pageviews are bucketed by UTC day, although the axis is by months to
 avoid making it essentially unreadable. It's generated in ggplot2 using
 theme_bw() (one of my favourite combinations)
 
  On 13 December 2014 at 12:33, Ed Summers e...@pobox.com wrote:
 
   On Dec 13, 2014, at 12:18 PM, Oliver Keyes oke...@wikimedia.org
 wrote:
  
   I'm not sure what this means (desktop users are weird? There's a lot
 of bot traffic we're not catching? That's my guess) but I thought it was
 pretty and might provoke some hypothesising. So, here you go!
 
  I think the axis labels are flipped? How are the page views bucketed:
 day, week, month, something else?
 
  It is a pretty  clean looking graph, what did you use to generate it?
 
  //Ed
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 
 
 
  --
  Oliver Keyes
  Research Analyst
  Wikimedia Foundation
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?

2014-09-20 Thread Oliver Keyes





-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] What works for increasing editor engagement?

2014-09-13 Thread Oliver Keyes

On 13 September 2014 20:52, James Salsman jsals...@gmail.com wrote:

 Pine wrote:
 
  I agree that the shift to mobile is a big deal;

 I do not agree: Active editor attrition began on its present trend in
 2007, far before any mobile use was significant.


I'm not seeing how that means it's not a big deal. Mobile now makes up 30%
of our page views and its users display divergent behavioural patterns; you
don't think a group that makes up 30% of pageviews is a user group that is
a 'big deal' for engagement?


  I remain concerned that tech-centric approaches
  to editor engagement like VE and Flow, while
  perhaps having a modest positive impact, do little
  to fix the incivility problem that is so frequently
  cited as a reason for people to leave.

 I agree that VE has already proven that it is ineffective in
 significantly increasing editor engagement. And I agree that Flow has
 no hope of achieving any substantial improvements. There are good
 reasons to believe that Flow will make things worse. For example,
 using wikitext on talk pages acts as a pervasive sandbox substitute
 for practicing the use of wikitext in article editing.

 And I do not agree that civility issues have any substantial
 correlation with editor attrition. There have been huge civility
 problems affecting most editors on controversial subjects since 2002,
 and I do not see any evidence that they have become any worse or
 better on a per-editor basis since.

 My opinion is that the transition from the need to create new articles
 to maintaining the accuracy and quality of existing articles has been
 the primary cause of editor attrition, and my studies of Short Popular
 Vital Articles (WP:SPVA) have supported this hypothesis.

 Therefore, I strongly urge implementation of accuracy review systems:

 https://strategy.wikimedia.org/wiki/Proposal:Develop_systems_for_accuracy_review

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Joining derp?

2014-09-04 Thread Oliver Keyes

Uhm. If you don't think there's any distinction in nature or terminology
between
'The contents of a form field intentionally filled out and submitted by a
user'  andevery other kind of data, there's a disconnect here somewhere.

On Thursday, 4 September 2014, Ed Summers e...@pobox.com wrote:


 On Sep 4, 2014, at 11:20 AM, aaron shaw aarons...@northwestern.edu
 javascript:_e(%7B%7D,'cvml','aarons...@northwestern.edu'); wrote:

 Sorry Ed, I don't think we all know that. In fact, I'm unaware of any way
 in which Wikimedia makes money based on data collected from its users. To
 my knowledge, the Foundation is supported almost entirely through private
 donations[1].


 Ok, try this on for size:

 An edit to a Wikipedia article is data collected from its users. WMF
 receives millions of dollars of donations a year because of this data, and
 its accessibility.

 //Ed



-- 
Sent from a portable device of Lovecraftian complexity.
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early

2014-07-23 Thread Oliver Keyes

I'm pretty sure that responding to well-intended and politely phrased
criticism with sarcasm is probably also not something that will help us in
avoiding losing contributors :p

I agree that this is not an immediately understandable thing about
contributions, although I think it should be more understandable by
reaearchers than It might be by the man on the Clapham omnibus (an analogy
would be 'not publishing the same paper in multiple journals') but my
concern is that information exists on an axis.

At one end we have the point at which the mass of information presented
scares people off before they even hit save. At the other is the point at
which the lack of information leads to somebody stumbling into a spiked
pit. Our goal is to find a point in the middle, and I'm pretty cautious
about attempts to add more documentation given that that's the direction
we've historically trended in.

On Wednesday, 23 July 2014, Kerry Raymond kerry.raym...@gmail.com wrote:

Well, I’m glad it’s that simple (sarcasm intended!). Do we really expect
new/occasional contributors to figure this out? Having been on Wikipedia
for 9 years, it’s all news to me. I always thought that clicking SAVE with

By clicking the Save page button, you agree to the Terms of Use
https://wikimediafoundation.org/wiki/Terms_of_Use and you irrevocably
agree to release your contribution under the CC BY-SA 3.0 License
https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
and the GFDL
https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License
with the understanding that a hyperlink or URL is sufficient for CC
BY-SA 3.0 attribution.

that I was releasing **my** contribution, full stop, end of story. If we
expect people to do more than this, shouldn’t it say something at this
point like “If your contribution has previously been published elsewhere,
please click here” and take people to a form where they can supply more
details and then hit SAVE. Let’s make it easier for people to do the right
thing instead of reverting them and losing them as contributors.

Kerry

*From:* wiki-research-l-boun...@lists.wikimedia.org
javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org');
[mailto:wiki-research-l-boun...@lists.wikimedia.org
javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org');]
*On Behalf Of *Maggie Dennis
*Sent:* Thursday, 24 July 2014 12:42 AM
*To:* Research into Wikimedia content and communities
*Subject:* Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting
early

Just a few points inline. :)

On Tue, Jul 22, 2014 at 5:50 AM, James Heilman jmh...@gmail.com
javascript:_e(%7B%7D,'cvml','jmh...@gmail.com'); wrote:

To clarify the proposal is:

1) only looking at new edits that add blocks of text over a certain size

2) only tagging those edits on a workspace page for further follow-up by
an experienced human editor

3) only running on articles of WikiProjects that want it and are willing
to follow-up (thus only WPMED for starters)

What it is NOT is: a tool to add notices to article space, a tool to warn
users on their talk pages, or a tool to look at old edits. It is also NOT
many other things. This is a very narrow proposal.

With respect to users who are adding content they own which they have
previously had published. What you do is you get them in an email to agree
to release it under a CC BY SA license and then send that email to OTRS.

Alternatively, they can skip this step if they are reproducing materials
from their own website by adding a release to that website.
https://en.wikipedia.org/wiki/Wikipedia:DCM talks about how. I speak to
that based on my volunteer experience, not my work experience. :)

One further point - if they are the *sole* copyright holder contributing
their own text work to Wikipedia, it must be colicensed under GFDL
according to our terms of use
https://wikimediafoundation.org/wiki/Terms_of_Use#7._Licensing_of_Content
.

Maggie

With respect to the number of edits, WPMED gets about 1000 a day. If we
say about 10% are of a significant size (a rather high estimate) and if we
say copy and paste issues occur in 10% with a same number of false
positives we are looking at 20 edits to review a day. Those within the
project are able to handle this volume in a timely manner.

--
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine

www.opentextbookofmedicine.com

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org');
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Maggie Dennis
Senior Community Advocate
Wikimedia Foundation, Inc.

--
Sent from a portable device of Lovecraftian

[Wiki-research-l] this month's research newsletter

2014-07-03 Thread Oliver Keyes

Both of these suggestions sound great to me! I'm not sure who the best
person is to move them forward (I encourage anyone who wants to volunteer
to speak up!) but whatever happens, I'm really grateful that we could turn
this into a 'how do we fix this in the long-term?' conversation and not get
bogged down - it's one of the most productive mailing list threads I've
seen in a while :)

On Thursday, 3 July 2014, Heather Ford hfor...@gmail.com
javascript:_e(%7B%7D,'cvml','hfor...@gmail.com'); wrote:

 Thanks so much for this, Kerry. And thanks, Aaron for (as always) great,
 productive suggestions.

 I think there are two issues that need to be dealt with separately here.
 The first is about disparaging remarks made about researchers'
 contributions that kicked off this discussion. One idea that I had when I
 saw a similar problem earlier this year was to at least have reviewers add
 their names to reviews so that we are making a clear distinction between
 the opinion of a single reviewer and the community/organisation as a whole.
 Some reviewers have added their names to reviews (thank you!) but I think
 that needs to be a standard for the newsletter. This probably won't solve
 the problem completely but hopefully reviewers will be more thoughtful
 about their critique in the future.

 The second is to encourage research about Wikipedia that engages with the
 Wikimedia community. And yes, I, too, think that awards and
 acknowledgements are great ideas. I'd say that, when evaluating, engagement
 is even more important than impact because we want to encourage students
 and researchers at various stages of their careers (many of whom would not
 win awards for impact) to engage with the community when working on these
 projects. Of course, this kind of work is necessarily going to have more
 impact because Wikimedians themselves are going to be a part of it somehow.
 For this, I definitely agree with some kind of acknowledgement of research
 done - beyond, perhaps, just one or two star researchers winning a few
 awards. This can be done together e.g. awards for best papers in different
 categories but also acknowledgements for work with the community on
 particular projects as suggested by Kerry.

 Best,
 Heather.

 Heather Ford
 Oxford Internet Institute http://www.oii.ox.ac.uk Doctoral Programme
 EthnographyMatters http://ethnographymatters.net | Oxford Digital
 Ethnography Group http://www.oii.ox.ac.uk/research/projects/?id=115
 http://hblog.org | @hfordsa http://www.twitter.com/hfordsa




 On 3 July 2014 02:56, Kerry Raymond kerry.raym...@gmail.com wrote:

 Having had a work role oversighting many university researchers including
 PHD and other research students, I think many start out with intentions to
 engage fully with stakeholders and contribute back into the real world in
 some way, but it's fair to say that deadline pressures tend to force them
 to focus their energies into the academically valued outcomes, e.g.
 published papers, theses, etc. This is just as true for Wikipedia-related
 research as for, say, aquaculture. Of course, some never intended to
 contribute back, but are solely motivated by climbing the greasy pole of
 academia.

 Because data gathering can be a time-consuming or expensive stumbling
 block in a research plan, organisations that freely publish detailed data
  (as WMF does) are natural magnets to researchers who can use that data to
 study various phenomena which may have broader relevance than just
 Wikipedia or where the Wikipedia data serves as a ground truth for other
 experiments or as proxy for other unavailable data. For example, you can
 use Wikipedia to study categorisation or named entity extraction without
 having real interest in Wikipedia itself.

 So I think it is for those who are passionate about Wikipedia itself to
 see how such research findings may be used to improve Wikipedia. As for
 releasing source code, it has to recognised that software in research
 projects is often very quick-and-dirty and probably not designed to be
 integrated into the MediaWiki code base. Effective solutions to Wikipedia
 issues often require a mix of technology and change to community
 process/culture (which is often far harder to get right).

 This is not to say they we should not encourage researchers to give
 back, but I think we do need to understand that the reasons people don't
 give back aren't always attributable solely to bad faith.

 In additions to suggestions already made re awards, just having a letter
 of commendation on WMF letterhead acknowledging the research and its
 potential to improve Wikipedia would be a useful thing especially for
 junior researchers seeking to establish themselves; this kind of external
 validation is helpful to their CVs. This could be sent to any researchers
 whose research was deemed to have merit with different wording for those
 who made (according to some appropriately-appointed group) greater or
 lesser contributions to real

Re: [Wiki-research-l] this month's research newsletter

2014-07-02 Thread Oliver Keyes

. We need to remember that 
 researchers
 are at very different stages of their careers, they have very different
 motivations, and different levels of engagement with the Wikipedia
 community, but that *all* research on Wikipedia contributes to our
 understanding (even if as a catalyst for improvements). We want to
 encourage more research on Wikipedia, not attack the motivations of 
 people
 we know little about - particularly when they're just students and
 particularly when this newsletter is on housed on Wikimedia Foundation's
 domain.

 Best,
 Heather.

 [1] https://meta.wikimedia.org/wiki/Research:Newsletter/2014/June
  [2]
 https://meta.wikimedia.org/wiki/Research:Newsletter/2014/June#.22Recommending_reference_materials_in_context_to_facilitate_editing_Wikipedia.22

 Heather Ford
 Oxford Internet Institute http://www.oii.ox.ac.uk/ Doctoral
 Programme
 EthnographyMatters http://ethnographymatters.net/ | Oxford Digital
 Ethnography Group http://www.oii.ox.ac.uk/research/projects/?id=115

 http://hblog.org | @hfordsa http://www.twitter.com/hfordsa


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Is 'Random article' statistically robust over what population?

2014-06-27 Thread Oliver Keyes

I don't know if anyone's looked into this, I'm afraid. I'd be interested to
see what our replication lag on production is. I imagine it's pretty small,
and so the impact would be negligible, but...


On 27 June 2014 23:24, stuart yeates syea...@gmail.com wrote:

 I'm designing an experiment and want a random sample of wiki articles. The
 'Random article' seems like a convenient way of generating these with
 having to compile a list of the population of articles myself.

 My hunch (based on clicking it lots and very little else), is that 'Random
 article' is a uniform sampling of pages in article namespace, excluding
 redirects but including disambiguation pages. As implemented on en.wiki
 (which is the wiki I'm starting on) it probably has a slight bias against
 very recently created pages (due to cross-server synchronization).

 Has anyone looked into this?

 cheers
 stuart

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Social Media and Learning - survey - Please help!

2014-06-10 Thread Oliver Keyes

Well, it seems to be a general social media survey, not something specific
to Wikimedia, so ;p.


On 9 June 2014 17:59, Piotr Konieczny pio...@post.pl wrote:

 Is it just me or would anyone else be more motivated if instead of a
 drawing for an ipad the researchers would promise to write a Good Article
 or teach a course in the Wikipedia Education Program framework or such?

 --

 Piotr Konieczny, PhD
 http://hanyang.academia.edu/PiotrKonieczny
 http://scholar.google.com/citations?user=gdV8_AEJ
 http://en.wikipedia.org/wiki/User:Piotrus


 On 6/6/2014 06:53, Anatoliy Gruzd wrote:

 Dear Wiki-Research-List instructors, teachers, faculty ...

 If you use social media for one or more of your classes, we would like to
 invite you to participate in an online survey. The survey should take you
 no longer than 35 minutes to complete. This survey is being conducted as
 part of a study on Social Media and Learning, supported by the Social
 Sciences and Humanities Research Council (SSHRC) of Canada.

 As a way to thank you for your participation in the survey, after
 completion, you will be given the option to enter your name and email
 address to enroll you in a random drawing to win one of three *Apple iPad
 minis*! The random drawing will take place on October 1, 2014 and the
 winner will be notified on the same day via email. Any optional contact
 information provided cannot be connected to your survey responses.

 If you would like to participate, please go to

 http://tinyurl.com/SMlearningsurvey


 PIs: Anatoliy Gruzd, Dalhousie University and Caroline Haythornthwaite,
 University of British Columbia

 *This survey has passed ethical review by both the Dalhousie University
 and the University of British Columbia




 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Kill the bots

2014-05-21 Thread Oliver Keyes

Okay. Methodology:

*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit

It looks like there are a relatively small number of bots that
browse/interact via the web - ones I can identify include WPCleaner[0],
which is semi-automated, something I can't find through WP or google called
DigitalsmithsBot (could be internal, could be external), and Hoo Bot (run
by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general
framework that could be masking multiple underlying bots and has ~ 7.4m
requests through the web interface in that time period.

Obvious caveat is obvious; the edits from these tools may actually come
through the API, but they're choosing to request content through the web
interface for some weird reason. I don't know enough about the software
behind each bot to comment on that. I can try explicitly looking for
web-based edit attempts, but there would be far fewer observations that the
bots might appear in, because the underlying dataset is sampled at a 1:1000
rate.

[0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation


On 20 May 2014 07:50, Oliver Keyes oke...@wikimedia.org wrote:

 Actually, belay that, I have a pretty good idea. I'll fire the log parser
 up now.


 On 20 May 2014 01:21, Oliver Keyes oke...@wikimedia.org wrote:

 I think a *lot* of them use the API, but I don't know off the top of my
 head if it's *all* of them. If only we knew somebody who has spent the
 last 3 months staring into the cthulian nightmare of our request logs and
 could look this up...

 More seriously; drop me a note off-list so that I can try to work out
 precisely what you need me to find out, and I'll write a quick-and-dirty
 parser of our sampled logs to drag the answer kicking and screaming into
 the light.

 (sorry, it's annual review season. That always gets me blithe.)


 On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote:

 Thanks all for the comments on my paper, and even more thanks to
 everyone sharing these super helpful ideas on filtering bots: this is why I
 love the Wikipedia research committee.

 I think Oliver is definitely right that

  this would be a useful topic for some piece of method-comparing
 research, if anyone is looking for paper ideas.

 Citation goldmine as one friend called it, I think.

 This won't address edit logs to date, but do  we know if most bots and
 automated tools use the API to make edits? If so, would it be feasibility
 to add a flag to each edit as to whether it came through the API or not.
 This won't stop determined users, but might be a nice way to identify
 cyborg edits from those made manually by the same user for many of the
 standard tools going forward.

 The closest thing I found in the bug tracker is [1], but it doesn't
 address the issue of 'what is a bot' which this thread has clearly shown is
 quite complex. An API-edit vs. non-API edit might be a way forward unless
 there are automated tools/bots that don't use the API.


 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181


 Cheers,
 Scott

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Kill the bots

2014-05-20 Thread Oliver Keyes

I think a *lot* of them use the API, but I don't know off the top of my
head if it's *all* of them. If only we knew somebody who has spent the last
3 months staring into the cthulian nightmare of our request logs and could
look this up...

More seriously; drop me a note off-list so that I can try to work out
precisely what you need me to find out, and I'll write a quick-and-dirty
parser of our sampled logs to drag the answer kicking and screaming into
the light.

(sorry, it's annual review season. That always gets me blithe.)


On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote:

 Thanks all for the comments on my paper, and even more thanks to everyone
 sharing these super helpful ideas on filtering bots: this is why I love the
 Wikipedia research committee.

 I think Oliver is definitely right that

  this would be a useful topic for some piece of method-comparing
 research, if anyone is looking for paper ideas.

 Citation goldmine as one friend called it, I think.

 This won't address edit logs to date, but do  we know if most bots and
 automated tools use the API to make edits? If so, would it be feasibility
 to add a flag to each edit as to whether it came through the API or not.
 This won't stop determined users, but might be a nice way to identify
 cyborg edits from those made manually by the same user for many of the
 standard tools going forward.

 The closest thing I found in the bug tracker is [1], but it doesn't
 address the issue of 'what is a bot' which this thread has clearly shown is
 quite complex. An API-edit vs. non-API edit might be a way forward unless
 there are automated tools/bots that don't use the API.


 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181


 Cheers,
 Scott

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread Oliver Keyes




 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wikipedia traffic: selected language versions

2014-05-18 Thread Oliver Keyes

Could you give an example of what we could do better than CLDR or the
relevant ISO standards?

On 18 May 2014 10:06, h hant...@gmail.com wrote:

Dear Nemo,

As I am waiting for a more complete response, I am not sure that I
understand your last No as in No, we definitely can't means. To
clarify, take the CLDR supplement Language-Territory information for
example

http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

One can suggest additions of the data point by submitting sourced
numbers for a geo-linguistic population like this:
http://unicode.org/cldr/trac/newticket?description=%3Cterritory%2c%20speaker%20population%20in%20territory%2c%20and%20references%3Esummary=Add%20territory%20to%20Traditional%20Chinese%20(zh_Hant)

In Wikipedia articles and Wikidata pages, there are many attempts to
provide more updated and better sourced data points. I see the potentials
in exchanging such data, curating them better in Wikidata projects as more
detailed and dynamic source than the CLDR.

These data points will have extra benefits in curating traffic data.
For one, these geo-linguistic population data points would be useful to
normalize traffic data for further analysis, such as geographic
normalization. For another, they provide important reference data for the
development strategies and policies of the Wikipedia projects.

Best,
han-teng liao

2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) nemow...@gmail.com:

Thanks for your suggestions. Just some quick pointers below.

h, 18/05/2014 08:26:

(I-A). Tabulate the data points in absolute numbers first, not
percentage numbers [...]

(I-B). Include all language versions for the *editing traffic* report as
well. [...]

(I-C). Provide static data objects in more accessible format (i.e. csv
and/or json). [...]

(II-A). Putting viewing traffic and editing traffic report on the same
page. [...]

(II-B). Organizing and archiving the traffic reports for historical
comparison. [...]

(I-C). Provide dynamic data objects in more accessible format (i.e. csv
and/or json).

At least the first four are just changes in the WikiStats reports
formatting, personally I encourage you to submit patches:
https://git.wikimedia.org/summary/analytics%2Fwikistats.git (should be
the squids directory, but there is some ongoing refactoring of the repos).

On archives and history rewriting/reports regeneration, see also
https://bugzilla.wikimedia.org/show_bug.cgi?id=46198

[...] (III-B). Smaller (i.e more specific) geographic aggregate units.

The country (geographic) information is often based on geo-IP databases,
and sometimes provincial and city-level data would be available.

http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html

[...]

( I know that the Unicode Common Locale Data Repository (CLDR Version 25
http://cldr.unicode.org/index/downloads/cldr-25)
provides“language-territory”
http://www.unicode.org/cldr/charts/latest/supplemental/
language_territory_information.htmlor
“territory-language”
http://www.unicode.org/cldr/charts/latest/supplemental/
territory_language_information.htmlunit-based

charts, but I believe that the Wikimedia projects can use and build one
better..) [...]

No, we definitely can't, not alone. I've asked for help, please
contribute: https://www.mediawiki.org/wiki/Universal_Language_
Selector/FAQ#How_does_Universal_Language_Selector_
determine_which_languages_I_may_understand.

Nemo

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Fwd: [Wmfall] Next research data showcase: tomorrow at 11.30

2014-03-19 Thread Oliver Keyes

Beginning in 10 minutes :) public stream link:
https://www.youtube.com/watch?v=bozyc1z25aQ

-- Forwarded message --
From: Dario Taraborelli dtarabore...@wikimedia.org
Date: 18 March 2014 20:42
Subject: [Wmfall] Next research  data showcase: tomorrow at 11.30
To: wmf...@lists.wikimedia.org Staff wmf...@lists.wikimedia.org

The next Research  Data
showcasehttps://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase
will
be live-streamed tomorrow at 11.30 PT (the streaming link will be posted on
the list a few minutes before the showcase starts. Those of you who are in
the SF office can join us in Yongle). This month's program is below, we
look forward to seeing you.

Dario

*Metrics standardization *(Dario)
In this talk I'll present the most recent updates on our work
on participation metrics and discuss the goals of the Editor Engagement
Vital Signs project.

*Wikipedia's rise and decline *(Aaron)
In Halfaker et al. (2013) we present data that show that several changes
the Wikipedia community made to manage quality and consistency in the face
of a massive growth in participation have ironically crippled the very
growth they were designed to manage. Specifically, the restrictiveness of
the encyclopedia's primary quality control mechanism and the algorithmic
tools used to reject contributions are implicated as key causes of
decreased newcomer retention.

___
Wmfall mailing list
wmf...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wmfall

-- 
Oliver Keyes
Product Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wikinews original reporting value as a measure of news events

2013-09-07 Thread Oliver Keyes

Questions:
What are those 22 variables?
How many datapoints did you get, distributed between how many categories?
How are you measuring correlation? Are we talking Pearson's?



On Sat, Sep 7, 2013 at 1:30 PM, Laura Hale la...@fanhistory.com wrote:


 https://meta.wikimedia.org/wiki/Research:Wikinews_original_reporting_value_as_a_measure_of_news_eventsThis
  is the first in a series of research pieces I am doing as part of
 program design efforts for The Wikinewsie Group.  Any feedback would be
 appreciated.  :)

 Sincerely,
 Laura Hale

 --
 twitter: purplepopple
 blog: ozziesport.com

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Wikinews original reporting value as a measure of news events

2013-09-07 Thread Oliver Keyes

The only list I see there has 18.


On Sun, Sep 8, 2013 at 3:49 AM, Laura Hale la...@fanhistory.com wrote:

 https://en.wikipedia.org/wiki/News_values

 The 22 items in the lst there.

 Sincerely,
 Laura Hale

 On Sunday, September 8, 2013, Oliver Keyes wrote:

 Questions:
 What are those 22 variables?
 How many datapoints did you get, distributed between how many categories?
 How are you measuring correlation? Are we talking Pearson's?



 On Sat, Sep 7, 2013 at 1:30 PM, Laura Hale la...@fanhistory.com wrote:


 https://meta.wikimedia.org/wiki/Research:Wikinews_original_reporting_value_as_a_measure_of_news_eventsThis
  is the first in a series of research pieces I am doing as part of
 program design efforts for The Wikinewsie Group.  Any feedback would be
 appreciated.  :)

 Sincerely,
 Laura Hale

 --
 twitter: purplepopple
 blog: ozziesport.com

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 mobile:   635209416
 twitter: purplepopple
 blog: ozziesport.com


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Why are users blocked on Wikipedia?

2013-05-05 Thread Oliver Keyes

On Sun, May 5, 2013 at 7:39 AM, Laura Hale la...@fanhistory.com wrote:

On Sat, May 4, 2013 at 2:43 PM, Federico Leva (Nemo)
nemow...@gmail.comwrote:

ENWP Pine, 04/05/2013 08:36:

Ironholds, would you be interested in investigating how stewards, global
sysops, and global rollbackers might be helpful in dealing with the spam
problem, especially for small wikis, and what new steps would be useful?

I doubt they need suggestions, they need tools:
https://www.mediawiki.org/**wiki/Admin_tools_developmenthttps://www.mediawiki.org/wiki/Admin_tools_development

The question is rather how much they are already helping: botspam,
obvious crosswiki vandalism and NOP are mostly handled globally,[1] so
local logs can only help assessing what's consuming the local communities
time, not what are the true menaces. In worst case, of course, you may even
be measuring the excuses to block rather than most important problems
users were creating (similarly to Al Capone ;) ).

The following is an analysis of the entire block log on English Wikinews.
It is currently at
https://en.wikinews.org/wiki/User:LauraHale/Blocks_on_English_Wikinews

*Ironholds wrote a summary of problems on English Wikipedia viewed
through block logs http://blog.ironholds.org/?p=31 in late April. This
is nominally based on that research to the extent that it is inspired by it
in terms of understanding blocking on English Wikinews.*

As referenced on the WMF research list, the issue of blocking is
potentially a very big deal for smaller projects. Problems can easily
overwhelm a small community if there is not an active community patrolling
recent changes in addition to the content work they are engaging in. For
English Wikinews, there were 22 active reporters in January
2013http://stats.wikimedia.org/wikinews/EN/TablesWikipediansEditsGt5.htm.
(This is tiny when SUL is a larger contributor to English Wikinews having
720,753 total registered users of which 0.00069% were active in January.)
At the same time, there were 64 blocks made that month. 39 of these blocks
were for spam. English Wikinews is one of the fortunate smaller projects:
We have two local Check Users https://en.wikinews.org/wiki/Wikinews:CUwho
respond quickly to problems. We generally have at least one admin awake
and monitoring recent changes at any given time. We have global CUs who can
and sometimes come in and block the big problems. Thus, we can deal with
the automated problem quite easily.

Since English Wikinews has opened and 27 April 2013, there have been
15,105 un/blocks. The following is based on the complete block log. On the
project, 4 types of block actions exist on English Wikinews: block,
unblock, log action removed and changed block settings for. They all appear
in the same block log, thus the 15,105 number is not total blocks but total
block related action. (If you harass a user, get blocked for a week for it,
get unblocked after promising to behave, get reblocked and then have your
block extended, there are three distinct actions. If you do that with an
offensive user name, the log action may be hidden, which is a fourth type
of action.)

Since 2005, 99 different people have taken an administrative blocking
action on English Wikinews. While there are currently only 36 admins on
English Wikinews, the number used to be higher and local policy is if you
do not use your admin privileges, you lose them. This is to prevent
potential abuse and to make sure all admins are aware of current policy in
order to prevent wheel-warring and other potentially damaging things to the
community. Amgine has blocked the most users on the project with 7162 block
related actions. Brian McNeil is second with 1522 related block actions. He
has been less active in the past 18 months or so. Cirt is third with 723.
Cirt is our most active local Check User. Pi zero is fourth with 503 block
related administrative actions. Tempodivalse, who is no longer involved
with the project, rounds out the top five with 487 block related
administrative actions. Amongst the next five administrators with blocks,
only one is actively involved on the administrator side, Cspurrier, who is
also a Check User who is ranked seventh for total administrator block
related actions with 343.

Who is getting blocked? There are 14,815 total entries identified to
specific accounts where an administrative blocking account was acted upon
the account, of which there were 12,643 unique accounts where
administrative blocking related action was taken connected to the account.
This means that there are 1,100 accounts with 2 or more administrative
blocked related actions taken regarding them.

This is somewhat ambiguous; 2 or more blocked-related actions could mean
either multiple blocks or a single unblock.

There were a number of people with large numbers of block related
administrator actions taken in relation to the account, including
Neutralizer with 71,

99 matches

Mail list logo