Re: [Wiki-research-l] [Analytics] question about Pageviews dumps
Aye, as Joseph says, the time-on-page or time-leaving is not collected, except as an extension of session reconstruction work. If you want a concrete time, you're not gonna get it. While PC-based data is more reliable than mobile, that does not necessarily mean "reliable". I'm sort of confused, I guess, as to why the datasets I linked (unless I'm misremembering them?) don't help: you would have to do the calculation yourself but they should contain all the data necessary to make that calculation (unless you want to have the pageID or title associated with the time-on-page, in which case...yeah, that's an issue). On Wed, Jun 29, 2016 at 3:16 AM, Marc Miquel <marcmiq...@gmail.com> wrote: > Thanks for the answer, Oliver. But I am not sure it answers my questions. I'd > like to study aspects like how much time is spent in certain pages, as a > proxy of how content is approached/read/understood. I'd be happy with time > of entering the page, time of leaving. This is not entirely centered on > 'user activity', but I said that because I imagined data would be stored in > a similar way to editor sessions, or in a database and I would need to do > the time calculations. > > Cheers, > > Marc > > > El dc., 29 juny, 2016 03:11, Oliver Keyes <ironho...@gmail.com> va > escriure: > >> If historic data is okay, there's already a dataset released ( >> https://figshare.com/articles/Activity_Sessions_datasets/1291033) that >> was designed specifically to answer questions around how to best calculate >> session length with regards to Wikipedia (http://arxiv.org/abs/1411.2878) >> >> On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquel <marcmiq...@gmail.com> >> wrote: >> >>> Hello! >>> >>> I was thinking about user sessions, yes, so this would mean to aggregate >>> pageviews visited by a user during a short amount of time (I should check >>> the cutoff, but it could be around an hour or less). >>> >>> I am particularly interested in understanding the order in which pages >>> are seen (start, end), duration, etc. >>> I wouldn't need data from a long period neither, but I think data from >>> multiple languages would be helpful. >>> >>> I imagined reader data could be sensitive to privacy, but would an NDA >>> with my university and some sort of data encoding help with this? As I >>> said, it is for a scientific purpose. >>> >>> Thanks, >>> >>> Marc >>> >>> El dt., 28 juny 2016 a les 21:09, Nuria Ruiz (<nu...@wikimedia.org>) va >>> escriure: >>> >>>> >>>> Hello! >>>> >>>> >I am considering to study reader engagement for different article >>>> topics in different languages. Because of this, I would like to know if >>>> there is >any plan to make available pageviews dumps detailing activity log >>>> at session level per user - in a similar way to editor sessions. >>>> >>>> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If so, >>>> no we do not have any projects to provide that data as due to privacy >>>> concerns we neither have nor keep that information. >>>> >>>> Thanks, >>>> >>>> Nuria >>>> >>>> >>>> >>>> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia <le...@wikimedia.org> wrote: >>>> >>>>> + Analytics >>>>> >>>>> >>>>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiq...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I have a question for you regarding pageviews datadumps. >>>>>> >>>>>> I am considering to study reader engagement for different article >>>>>> topics in different languages. Because of this, I would like to know if >>>>>> there is any plan to make available pageviews dumps detailing activity >>>>>> log >>>>>> at session level per user - in a similar way to editor sessions. >>>>>> >>>>>> Since this would be for a research project I might ask funding for >>>>>> it, I would like to know if I could count on that, what is the nature of >>>>>> the available data, and what would be the procedure to obtain this data >>>>>> and >>>>>> if there would be any implication because of privacy concerns. >>>>>> >>>>>> Thank you
Re: [Wiki-research-l] [Analytics] question about Pageviews dumps
If historic data is okay, there's already a dataset released ( https://figshare.com/articles/Activity_Sessions_datasets/1291033) that was designed specifically to answer questions around how to best calculate session length with regards to Wikipedia (http://arxiv.org/abs/1411.2878) On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquelwrote: > Hello! > > I was thinking about user sessions, yes, so this would mean to aggregate > pageviews visited by a user during a short amount of time (I should check > the cutoff, but it could be around an hour or less). > > I am particularly interested in understanding the order in which pages are > seen (start, end), duration, etc. > I wouldn't need data from a long period neither, but I think data from > multiple languages would be helpful. > > I imagined reader data could be sensitive to privacy, but would an NDA > with my university and some sort of data encoding help with this? As I > said, it is for a scientific purpose. > > Thanks, > > Marc > > El dt., 28 juny 2016 a les 21:09, Nuria Ruiz ( ) va > escriure: > >> >> Hello! >> >> >I am considering to study reader engagement for different article >> topics in different languages. Because of this, I would like to know if >> there is >any plan to make available pageviews dumps detailing activity log >> at session level per user - in a similar way to editor sessions. >> >> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If so, no >> we do not have any projects to provide that data as due to privacy concerns >> we neither have nor keep that information. >> >> Thanks, >> >> Nuria >> >> >> >> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia wrote: >> >>> + Analytics >>> >>> >>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel >>> wrote: >>> Hello, I have a question for you regarding pageviews datadumps. I am considering to study reader engagement for different article topics in different languages. Because of this, I would like to know if there is any plan to make available pageviews dumps detailing activity log at session level per user - in a similar way to editor sessions. Since this would be for a research project I might ask funding for it, I would like to know if I could count on that, what is the nature of the available data, and what would be the procedure to obtain this data and if there would be any implication because of privacy concerns. Thank you very much! Best, Marc Miquel ᐧ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> ___ >>> Analytics mailing list >>> analyt...@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> ___ >> Analytics mailing list >> analyt...@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Training for administrators
Hey Pine, That's good to hear; I am glad to hear people will be doing outreach on this. I'd argue that this is one of those areas where experts are needed in structuring the nature of the proposal, not just the materials used, and so in the future it would be nice if they were involved in the design of the grant request itself. On Fri, Jun 10, 2016 at 3:06 PM, Pine W <wiki.p...@gmail.com> wrote: > Hi Oliver, > > In terms of concrete plans to involve at least one expert, I've made a > couple of initial queries to see if there is a professor at the University > of Washington's Department of Psychology who would be qualified and > interested to work on this proposal. Outreach and screening of potential > consultants will take on greater importance if this idea gets traction on > the Wikimedia side. At this point I think it's unlikely that I will be the > project lead, so whoever does become the project lead on the Wikimedia side > will likely need to do further work on outreach and screening to select the > final expert(s). > > Pine > > On Tue, Jun 7, 2016 at 8:58 AM, Oliver Keyes <ironho...@gmail.com> wrote: >> >> Well, my feedback would be that for this to be useful I would expect >> researchers *from* that subject-matter background to be involved. I >> don't see this (nor a concrete plan for their involvement). >> >> On Mon, Jun 6, 2016 at 11:02 PM, Pine W <wiki.p...@gmail.com> wrote: >> > Hi folks, >> > >> > Related to discussions that some of us have had previously about (1) >> > developing training for Wikipedia administrators, (2) increasing the >> > community's capacity to address incivility and harassment, and (3) >> > improving >> > community health, I've proposed >> > >> > https://meta.wikimedia.org/wiki/Grants:IdeaLab/Training_for_administrators >> > as a part of the current Inspire campaign. >> > >> > This project could become an extension of my current video project, or >> > it >> > might be handled completely independently by a different project leader. >> > >> > Regardless of who eventually leads the project, I would appreciate your >> > comments about the proposal, whether positive, negative, or indifferent. >> > Please discuss on the IdeaLab pages. (: >> > >> > Thanks! >> > >> > Pine >> > >> > ___ >> > Wiki-research-l mailing list >> > Wiki-research-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Training for administrators
Well, my feedback would be that for this to be useful I would expect researchers *from* that subject-matter background to be involved. I don't see this (nor a concrete plan for their involvement). On Mon, Jun 6, 2016 at 11:02 PM, Pine Wwrote: > Hi folks, > > Related to discussions that some of us have had previously about (1) > developing training for Wikipedia administrators, (2) increasing the > community's capacity to address incivility and harassment, and (3) improving > community health, I've proposed > https://meta.wikimedia.org/wiki/Grants:IdeaLab/Training_for_administrators > as a part of the current Inspire campaign. > > This project could become an extension of my current video project, or it > might be handled completely independently by a different project leader. > > Regardless of who eventually leads the project, I would appreciate your > comments about the proposal, whether positive, negative, or indifferent. > Please discuss on the IdeaLab pages. (: > > Thanks! > > Pine > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Wikitech-l] Wired article about machine learning
+100. Last I checked the Dartmouth Conference's premise still hadn't been satisfied, so calling anything ML can do AI is just clickbait froth. But I'm agreed that the non-AI "AI" stuff is both the power and the danger here, and this kind of overselling is...risky. As an example - this weekend Pro Publica published the results of a study on automated model generation used for determining prisoner reoffence risk. To the surprise of nobody, they found the models trend towards automated racism little better than a coinflip.[0] It's never going to write your code for you, or have a conversation about the weather, or Codsworth it up,[1] but it's here nonetheless lurking in the background, determining the course of human lives, and with more ink spent 'explaining' how Robots Are Going To Eliminate Programming than Robots Are Going To Automate Bigotry. Agreed on "scary" :| [0] https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing [1] https://www.youtube.com/watch?v=3kacrYB8Li0 On Mon, May 23, 2016 at 10:33 AM, Aaron Halfakerwrote: > Just a quick thought that I shared in IRC earlier. > >> AI isn't magical. It's pretty cool, but you're not going to have a >> conversation with ORES. > > > It's not false that we are closer to strong "conversational" AI than ever > before. Still, in practical terms, we're pretty far away from not needing > to program anymore. I find that articles like this are more fantastical > than informative. I guess it is interesting to think about where we'll be > when we can have an abstract conversation with a computer system rather than > the rigid specifics of programming, but I'm with Brian -- this seems to be a > cycle. Though, I'd say the media does boom and bust, but the research > carries on relatively consistently since AI researchers are usually less > interested in the hype. > > In the ORES project, we're using the most simplistic "AIs" available -- > classifiers. Still these dumb AIs can still help us to do amazing things > (e.g. review all of RecentChanges in 50x faster or augment article histories > with information about the *type of change* made). IMO, it's these amazing > and powerful things that dumb, non-conversational AIs can do that is very > powerful and a little scary. We're hardly taking advantage of that at all. > I think that's where the next big revolution with AI is taking place right > now. It's going to change a lot of things and infect many aspects of our > life (and in many ways it already has). > > -Aaron > > On Fri, May 20, 2016 at 2:43 PM, Purodha Blissenbach > wrote: >> >> I see only an ad to support Wired. >> Purodha >> >> >> On 20.05.2016 20:11, Pine W wrote: >>> >>> Seems like a good summary: http://www.wired.com/2016/05/the-end-of-code/ >>> >>> Comments welcome, especially from Wikimedia AI experts who are working on >>> ORES. >>> >>> Pine >>> ___ >>> Wikitech-l mailing list >>> wikitec...@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> >> ___ >> Wikitech-l mailing list >> wikitec...@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikipedia and SMEs; another article about the Stanford Encyclopedia of Philosophy
It did, yes, but that wasn't it's primary focus - AFT is an example of expert engagement in the same way it's an example of PHP: sure it uses it but that's not necessarily what comes to mind when you think of it. (I appreciate I've left myself open to quite a lot of comments about precisely what does come to mind for people when they think of AFT. Mostly obscenities, I suspect.) I quite like the GLAM+STEM idea - is it being discussed on a list somewhere? (Absent here, which may not be the right location.) On Mon, May 23, 2016 at 2:30 PM, Pine Wwrote: > AFT did try to engage readers, but if I recall correctly it had a checkbox > saying something like "I am an expert on this subject and I want to provide > feedback." This is reaching far back in my hazy memory, but I think that > similar features were present in both AFT3 and AFT5. > > That's an interesting idea about getting GLAM to focus on review in addition > to content creation. FloNight and I have also been talking about expanding > the GLAM concept to what I'm calling GLAM+STEM, meaning that we're > interested in engaging STEM institutions as well as GLAM institutions in > content creation (and potentially content quality review.) > > Pine > > On Mon, May 23, 2016 at 11:17 AM, WereSpielChequers > wrote: >> >> I thought AFT was an attempt to engage readers not Subject Matter Experts. >> >> In my experience two of our most effective ways to outreach to those >> experts who are not already in the community are the GLAM program and >> potentially the education program. >> >> This was one of the areas that Johnbod explored in his time as Wikimedians >> in Residence at Cancer Research UK. You might want to talk to him as to how >> that went and the extent to which it could be replicated. The focus of a lot >> of residents has been more on getting openly licensed digital material, but >> I don't see why we couldn't have more residencies focussed on expert review, >> providing of course that the articles in that area are already at a stage >> worthy of review. >> >> >> >> >> >> On 23 May 2016 at 18:34, Pine W wrote: >>> >>> Another article on the Stanford Encyclopedia of Philosophy. [1] I wonder, >>> could any of the practices described here be implemented on Wikipedia in a >>> way that would be helpful? WMF tried to engage SMEs through the now >>> mothballed AFT, and I believe that there is an ongoing effort to get SME >>> comments with the assistance of a bot facilitating communications from SMEs >>> to article talk pages (Aaron, do you remember the name of that project, and >>> if so could we get an update about it?) >>> >>> Thanks, >>> Pine >>> >>> [1] >>> http://qz.com/480741/this-free-online-encyclopedia-has-achieved-what-wikipedia-can-only-dream-of/ >>> >>> ___ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] design patterns for peer learning and peer production, with wikimedia case study - preprint
On 28 December 2015 at 10:03, Joe Corneli <holtzerman...@gmail.com> wrote: > On Mon, Dec 28 2015, Oliver Keyes wrote: > >> My big question is how these pedagogic maps factor in the negatives of >> peer production communities - harassment, toxicity - and route around >> or solve for them. > > Hi Oliver, > > Thanks for the speedy and thought-provoking reply! > > The question above is a good one. We did have a basic collection of > "antipatterns", but didn't develop them in this paper, because thinking > about antipatterns adds some complexity and we wanted to get the > "positive" vision more firmly in mind first. With that accomplished, > I'd love to write a sequel sometime about "Antipatterns of Peeragogy"! > Cool! This makes sense and is one of the concerns I've heard about including antipatterns and patterns together; that it leads to claims of a work "lacking focus". I would argue (just for myself, and editorial boards probably feel very very differently) that not including antipatterns makes a design pattern or template of limited applicability and so said editorial boards should be approving of it - but that's, again, just for me ;p. > Still, the current catalog should definitely help surface and do > something about concerns. The strategy would be something like: start > with the Scrapbook pattern and existing critiques, develop a short list > of criticisms into A specific project, and build a Roadmap that involves > others in addressing the issue that was identified. > > A recent thread kicked off by Pine seems to be an example along those > lines: > https://lists.wikimedia.org/pipermail/wiki-research-l/2015-December/004927.html > >> I do wonder about the generalisability of some of the examples; in >> particular while Wikiprojects are _ideally_ a good starting point for >> a lot of newcomers I don't have the data to hand about whether, in >> practice, it is the starting point for a large proportion of users, >> and I don't see citations to that effect in your paper (although I do >> see the claim). It would be good if someone more informed about this >> particular question than I could chip in with what they've >> measured/observed in detail (I know some people have been studying >> Wikiprojects specifically, particularly James Hare) > > I've been impressed with some of my own earlier common-sensical > guesswork that turned out not to hold water, and accordingly have tried > to be careful to cite or footnote the Wikimedia evidence, but indeed > that is one of the intuitive claims that is ^[citation needed]. Even > though there are "many" users involved with Wikiprojects, the population > might be oldtimers rather than new users. I'll look around a bit more, > and/or adjust the claim to focus on current population of Wikiproject > contributors rather than on the hypothesis that the projects are used > for wiki onramping. > Yeah; from my own subjective experiences it's more oldtimers than newtimers, but this may also be common-sensical-but-not-holiding-water! > Joe > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] design patterns for peer learning and peer production, with wikimedia case study - preprint
Hey Joe, My big question is how these pedagogic maps factor in the negatives of peer production communities - harassment, toxicity - and route around or solve for them. The inclusion of carrying capacity, and explicit recognition of the costs of labour overall, is great to see. But I would love to see roadmaps that factor in the "dark side" here, and the specific emotional labour costs of dealing with that dark side. Without factoring those things in, the practical utility of the roadmaps - outside of publishing - is likely to be somewhat constrained and difficult to scale. And in a year where we have learned more and more about the costs around a lot of collaborative and communicative environments, from Wikipedia to Twitter, including these things (or recognising them) is really not optional. I don't see it discussed in your work (I admit that I may have just missed it, and please let me know if so!) The patterns themselves are excellent, however, and I really like the structure of the work. I do wonder about the generalisability of some of the examples; in particular while Wikiprojects are _ideally_ a good starting point for a lot of newcomers I don't have the data to hand about whether, in practice, it is the starting point for a large proportion of users, and I don't see citations to that effect in your paper (although I do see the claim). It would be good if someone more informed about this particular question than I could chip in with what they've measured/observed in detail (I know some people have been studying Wikiprojects specifically, particularly James Hare) On 28 December 2015 at 09:17, Joe Corneli <holtzerman...@gmail.com> wrote: > > http://metameso.org/~joe/docs/peeragogy_pattern_catalog_proceedings.pdf > > is a preprint of the paper "Patterns of Peeragogy" to appear in > Proceedings of Pattern Languages of Programs 2015. > > Abstract: We describe nine design patterns that we have developed in our > work on the Peeragogy project, in which we aim to help design the future > of learning, inside and outside of institutions. We use these patterns > to build an “emergent roadmap” for the project. > > This paper may be of interest to people here, particularly since we > trace through the ways in which the patterns manifest in Wikimedia > projects. > > The final revision is due January 15th so comments before then still > have a chance to improve the final document. > > When it appears, the bibtex citation will be: > > @inproceedings{patterns-of-peeragogy, > title={Patterns of {P}eeragogy}, > author={Corneli, Joseph and Danoff, Charles Jeffrey and Pierce, Charlotte and > Ricuarte, Paola and Snow MacDonald, Lisa}, > booktitle={Pattern {L}anguages of {P}rograms {C}onference 2015 ({PLoP'15}), > {P}ittsburgh, {PA}, {USA}, {O}ctober 24-26, 2015}, > editor={Correia, Filipe}, > year={2015}, > publisher={ACM}} > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] What Wikimedia Research is up to in the next quarter
Awesome; thanks! On 21 December 2015 at 12:12, Leila Zia <le...@wikimedia.org> wrote: > Hi Oliver, > > On Sat, Dec 19, 2015 at 12:01 AM, Oliver Keyes <oke...@wikimedia.org> wrote: >> >> So what's happening with the link recommendation system? Is that >> rolled into article-creation recommendations, or was the paper the >> final product? > > > The short answer is: the paper is ideally not the final product (although > I'm thrilled with the reviews we received for it) as we like to see this > system used. > > From the research perspective, we want to have a tool where we can collect > data to learn more about and improve the link recommendation system. We've > had extensive conversations with Pau about this. The model we're working > towards is a landing page where the user can get different types of > recommendations: link recommendations* and article-creation recommendations, > and maybe more forms of recommendations in the future. In the past three > months, Ashwin has worked closely with Pau, Nirzar, and Ed to bring us > closer to having such a tool. The tool is not ready to be used yet (really! > you'll see it as soon as you click Add Links), but if you're curious to see > where we are with it, please check > http://tools.wmflabs.org/navlink-recommendation/ > > From the product perspective, the Editing team has set their goal to test > the tool (via the same tool on wmflabs) and if successful, figuring out the > next steps for it. Please reach out to the team directly if you like to know > more. > > Best, > Leila > > * Initially, the tool was going to have link recommendations for two cases: > where the anchor text existed in the article text, and where it didn't. We > learned that when the anchor text does not exist, the task becomes a much > harder one from the user's perspective, since the decision is no longer > whether the link should be added or not but where it should be added and in > what context. The tool that you see now will only have link recommendations > where the anchor text exists. > >> >> >> On 19 December 2015 at 01:53, Dario Taraborelli >> <dtarabore...@wikimedia.org> wrote: >> > >> > On Dec 18, 2015, at 10:13 PM, Gerard Meijssen >> > <gerard.meijs...@gmail.com> >> > wrote: >> > >> > Hoi, >> > Where does it say what languages are covered >> > >> > >> > >> > https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service#Support_table >> > >> > and, what languages are planned for support? >> > >> > >> > >> > https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Progress_report:_2015-11-28 >> > >> > although what gets in production will depend on many factors, such as >> > community support to generate labeled data, performance of the model >> > etc. >> > >> > Dario >> > >> > Thanks, >> > GerardM >> > >> > On 19 December 2015 at 05:16, Dario Taraborelli >> > <dtarabore...@wikimedia.org> >> > wrote: >> >> >> >> Hey all, >> >> >> >> I’m glad to announce that the Wikimedia Research team’s goals for the >> >> next >> >> quarter (January - March 2016) are up on wiki. >> >> >> >> The Research and Data team will continue to work with our volunteers >> >> and >> >> collaborators on revision scoring as a service adding support for 5 new >> >> languages and prototyping new models (including an edit type >> >> classifier). We >> >> will also continue to iterate on the design of article creation >> >> recommendations, running a dedicated campaign in coordination with >> >> existing >> >> editathons to improve the quality of these recommendations. Finally, we >> >> will >> >> extend a research project we started in November aimed at understanding >> >> the >> >> behavior of Wikipedia readers, by combining qualitative survey data >> >> with >> >> behavioral analysis from our HTTP request logs. >> >> >> >> The Design Research team will conduct an in-depth study of user needs >> >> (particularly readers) on the ground in February. We will continue to >> >> work >> >> with other Wikimedia Engineering teams throughout the quarter to ensure >> >> the >> >> adoption of human-centered design principles and pragmatic personas in >> >> our >> >> product d
Re: [Wiki-research-l] What Wikimedia Research is up to in the next quarter
So what's happening with the link recommendation system? Is that rolled into article-creation recommendations, or was the paper the final product? On 19 December 2015 at 01:53, Dario Taraborelli <dtarabore...@wikimedia.org> wrote: > > On Dec 18, 2015, at 10:13 PM, Gerard Meijssen <gerard.meijs...@gmail.com> > wrote: > > Hoi, > Where does it say what languages are covered > > > https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service#Support_table > > and, what languages are planned for support? > > > https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Progress_report:_2015-11-28 > > although what gets in production will depend on many factors, such as > community support to generate labeled data, performance of the model etc. > > Dario > > Thanks, > GerardM > > On 19 December 2015 at 05:16, Dario Taraborelli <dtarabore...@wikimedia.org> > wrote: >> >> Hey all, >> >> I’m glad to announce that the Wikimedia Research team’s goals for the next >> quarter (January - March 2016) are up on wiki. >> >> The Research and Data team will continue to work with our volunteers and >> collaborators on revision scoring as a service adding support for 5 new >> languages and prototyping new models (including an edit type classifier). We >> will also continue to iterate on the design of article creation >> recommendations, running a dedicated campaign in coordination with existing >> editathons to improve the quality of these recommendations. Finally, we will >> extend a research project we started in November aimed at understanding the >> behavior of Wikipedia readers, by combining qualitative survey data with >> behavioral analysis from our HTTP request logs. >> >> The Design Research team will conduct an in-depth study of user needs >> (particularly readers) on the ground in February. We will continue to work >> with other Wikimedia Engineering teams throughout the quarter to ensure the >> adoption of human-centered design principles and pragmatic personas in our >> product development cycle. We’re also excited to start a collaboration with >> students at the University of Washington to understand what free online >> information resources (including, but not limited to, Wikimedia projects) >> students use. >> >> I am also glad to report that two papers on link and article >> recommendations (the result of a formal collaboration with a team at >> Stanford) were accepted for presentation at WSDM '16 and WWW ’16 (preprints >> will be made available shortly). An overview on revision scoring as a >> service was published a few weeks ago on the Wikimedia blog, and got some >> good media coverage. >> >> We're constantly looking for contributors and as usual we welcome feedback >> on these projects via the corresponding talk pages on Meta. You can contact >> us for any question on IRC via the #wikimedia-research channel and follow >> @WikiResearch on Twitter for the latest Wikipedia and Wikimedia research >> updates hot off the press. >> >> Wishing you all happy holidays, >> >> Dario and Abbey on behalf of the team >> >> >> Dario Taraborelli Head of Research, Wikimedia Foundation >> wikimediafoundation.org • nitens.org • @readermeter >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Community policing, New Page Patrol, Articles for Creation, and editor retention
We can probably talk about the nature of new page patrol without resorting to comparisons to violent, real-world overreactions with multiple serious injuries. To be perfectly honest as a new page patroller the biggest issue I've seen is toxic senior members of the community making the prospect of patrolling particularly unpleasant. It doesn't do much for patroller numbers. On 15 December 2015 at 18:28, Pine W <wiki.p...@gmail.com> wrote: > Yesterday I gave a presentation about community policing at the Cascadia > Wikimedians' end of year event with Seattle TA3M [1][2][3]. An issue that > came up for discussion is the extent to which, on English Wikipedia, > experienced Wikipedians conducting New Page Patrol create collateral damage > during their well-intentioned efforts to protect Wikipedia. Another subject > that came up is the need for more human resources for mentoring of newbies > who create articles using the Articles for Creation system [4]; one comment > I've heard previously is that the length of time between submission and > review may be long enough for the newbie to give up and disappear, and > another comment that I've heard is that newbies may not understand the > instructions that they're given when their article is reviewed. These > comments correlate with the community SWOT analysis that was done at > WikiConference USA this year, in which "biting the newbies", NPP, and > "onboarding/training" were identified as weaknesses [5] > > Personally, I would like the interaction of experienced editors with the > newbies in places like NPP and AFC to look more like this and less like > this. Granted, it's hard for a relatively small number of experienced > Wikipedians to keep all the junk and vandals out while also mentoring the > newbies and avoiding collateral damage, so one strategy could be to increase > the quantity of skilled human resources that are devoted to these domains. > Any thoughts on how to make that happen? > > I am currently especially interested in this topic because of my IEG project > which officially starts this week. [6] It would be very helpful to retain > the new editors that are trained through these videos, so improving editor > retention via improved newbie experiences at NPP and/or AFC would be most > welcome. > > Pine > > [1] https://en.wikipedia.org/wiki/Community_policing > [2] https://en.wikipedia.org/wiki/Police_reform_in_the_United_States > [3] > https://commons.wikimedia.org/wiki/File:Presentations_at_Cascadia_Wikimedians_and_Seattle_TA3M_meetup,_December_2015.jpg > [4] https://en.wikipedia.org/wiki/Wikipedia:Articles_for_creation > [5] > https://commons.wikimedia.org/wiki/File:SWOT_analysis_of_Wikipedia_in_2015.jpg > [6] > https://meta.wikimedia.org/wiki/Grants:IEG/Motivational_and_educational_video_to_introduce_Wikimedia > > > _______ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Community policing, New Page Patrol, Articles for Creation, and editor retention
Well, we don't really have a judicial approach either; judges get booted when they're biased or refusing to apply the law ;). I would agree that it is a small circle of people, and I would agree that they have a far larger impact than numbers would suggest. Community Advocacy is currently running a harassment consultation at https://meta.wikimedia.org/wiki/Harassment_consultation_2015 - I suggest looking at the proposals there. On 15 December 2015 at 19:00, Pine W <wiki.p...@gmail.com> wrote: > Maybe it's just the circles that I happen to circulate in, but it seems to > me that a very small percentage of Wikipedians tend to be consistently harsh > or toxic, and that small number of people tends to have disproportionately > negative influence on the atmosphere in the community. Aligned with Jimbo's > comments at Wikimania 2014 in London, I do wonder if their caustic nature > rises to the level where they should be excluded from the community, and if > so, on what grounds we would make that exclusion. Being a relentless critic > doesn't necessarily rise to the level of harassment if it's done broadly > rather than directed at a particular individual or group, but looking at the > problem from an HR perspective rather than a judicial one, I agree that > maybe more should be done to exclude toxic personalities. I wonder, though, > how we can do that; our process for excluding people from the community is > more like a judicial process than like an HR process. Maybe we need more of > an HR approach? > > Pine > > On Tue, Dec 15, 2015 at 3:51 PM, Oliver Keyes <oke...@wikimedia.org> wrote: >> >> We can probably talk about the nature of new page patrol without >> resorting to comparisons to violent, real-world overreactions with >> multiple serious injuries. >> >> To be perfectly honest as a new page patroller the biggest issue I've >> seen is toxic senior members of the community making the prospect of >> patrolling particularly unpleasant. It doesn't do much for patroller >> numbers. >> >> On 15 December 2015 at 18:28, Pine W <wiki.p...@gmail.com> wrote: >> > Yesterday I gave a presentation about community policing at the Cascadia >> > Wikimedians' end of year event with Seattle TA3M [1][2][3]. An issue >> > that >> > came up for discussion is the extent to which, on English Wikipedia, >> > experienced Wikipedians conducting New Page Patrol create collateral >> > damage >> > during their well-intentioned efforts to protect Wikipedia. Another >> > subject >> > that came up is the need for more human resources for mentoring of >> > newbies >> > who create articles using the Articles for Creation system [4]; one >> > comment >> > I've heard previously is that the length of time between submission and >> > review may be long enough for the newbie to give up and disappear, and >> > another comment that I've heard is that newbies may not understand the >> > instructions that they're given when their article is reviewed. These >> > comments correlate with the community SWOT analysis that was done at >> > WikiConference USA this year, in which "biting the newbies", NPP, and >> > "onboarding/training" were identified as weaknesses [5] >> > >> > Personally, I would like the interaction of experienced editors with the >> > newbies in places like NPP and AFC to look more like this and less like >> > this. Granted, it's hard for a relatively small number of experienced >> > Wikipedians to keep all the junk and vandals out while also mentoring >> > the >> > newbies and avoiding collateral damage, so one strategy could be to >> > increase >> > the quantity of skilled human resources that are devoted to these >> > domains. >> > Any thoughts on how to make that happen? >> > >> > I am currently especially interested in this topic because of my IEG >> > project >> > which officially starts this week. [6] It would be very helpful to >> > retain >> > the new editors that are trained through these videos, so improving >> > editor >> > retention via improved newbie experiences at NPP and/or AFC would be >> > most >> > welcome. >> > >> > Pine >> > >> > [1] https://en.wikipedia.org/wiki/Community_policing >> > [2] https://en.wikipedia.org/wiki/Police_reform_in_the_United_States >> > [3] >> > >> > https://commons.wikimedia.org/wiki/File:Presentations_at_Cascadia_Wikimedians_and_Seattle_TA3M_meetup,_December_2015.jpg >> > [4] https://en.wikipedia.o
Re: [Wiki-research-l] Community policing, New Page Patrol, Articles for Creation, and editor retention
Well, what is and isn't a reliable source is discussed at various noticeboards and set into stone, so it's more like saying "you published this in a journal on Beall's list" On 15 December 2015 at 20:35, Kerry Raymond <kerry.raym...@gmail.com> wrote: > I agree with Pine. It’s often patterns of behaviour that are more > significant than some individual incident. The drip-drip-drip of constant > criticism from a colleague can wear out most people. And if it’s done with > AWB or other tool, it’s very easy to grind down other people down, > especially as most people don’t know what ways they have to complain about > such behaviour and, in any case, most complaints have to lodged on-wiki > (which presumably discourages most people from doing it). Why do we allow > the bullies to write the rules of this playground? > > > > For example, there is a user account that removes the word “comprises”, a > word their user page says they don’t like for various reasons (but none of > which appear to relate to Wikipedia policy) . Why is this one user through > their persistence allowed to decide what words are used in Wikipedia > articles? Another bully (and I can see no other way to describe their > behaviour) has a long edit history full of reversions with the edit summary > “no source provided” or “not a reliable source” (which seems to be something > you can say about just about anything – rather like the way you can > criticise most research with “but, with a larger longer study, it might show > different results?”). > > > > Kerry > > > > > > From: Wiki-research-l [mailto:wiki-research-l-boun...@lists.wikimedia.org] > On Behalf Of Pine W > Sent: Wednesday, 16 December 2015 10:11 AM > To: Research into Wikimedia content and communities > <wiki-research-l@lists.wikimedia.org> > Subject: Re: [Wiki-research-l] Community policing, New Page Patrol, Articles > for Creation, and editor retention > > > > The problems that I'm contemplating here are, for better and for worse, > outside the scope of what I would consider harassment. I think that they > could be described as toxic interactions in general, and/or a shortage of or > long-delayed positive interactions at places like NPP and AFC. > > Pine > > > > On Tue, Dec 15, 2015 at 4:02 PM, Oliver Keyes <oke...@wikimedia.org> wrote: > > Well, we don't really have a judicial approach either; judges get > booted when they're biased or refusing to apply the law ;). I would > agree that it is a small circle of people, and I would agree that they > have a far larger impact than numbers would suggest. Community > Advocacy is currently running a harassment consultation at > https://meta.wikimedia.org/wiki/Harassment_consultation_2015 - I > suggest looking at the proposals there. > > > On 15 December 2015 at 19:00, Pine W <wiki.p...@gmail.com> wrote: >> Maybe it's just the circles that I happen to circulate in, but it seems to >> me that a very small percentage of Wikipedians tend to be consistently >> harsh >> or toxic, and that small number of people tends to have disproportionately >> negative influence on the atmosphere in the community. Aligned with >> Jimbo's >> comments at Wikimania 2014 in London, I do wonder if their caustic nature >> rises to the level where they should be excluded from the community, and >> if >> so, on what grounds we would make that exclusion. Being a relentless >> critic >> doesn't necessarily rise to the level of harassment if it's done broadly >> rather than directed at a particular individual or group, but looking at >> the >> problem from an HR perspective rather than a judicial one, I agree that >> maybe more should be done to exclude toxic personalities. I wonder, >> though, >> how we can do that; our process for excluding people from the community is >> more like a judicial process than like an HR process. Maybe we need more >> of >> an HR approach? >> >> Pine >> >> On Tue, Dec 15, 2015 at 3:51 PM, Oliver Keyes <oke...@wikimedia.org> >> wrote: >>> >>> We can probably talk about the nature of new page patrol without >>> resorting to comparisons to violent, real-world overreactions with >>> multiple serious injuries. >>> >>> To be perfectly honest as a new page patroller the biggest issue I've >>> seen is toxic senior members of the community making the prospect of >>> patrolling particularly unpleasant. It doesn't do much for patroller >>> numbers. >>> >>> On 15 December 2015 at 18:28, Pine W <wiki.p...@gmail.com> wrote: >>> > Yesterday I gave a presentation
[Wiki-research-l] R client for the new Pageviews API
Hey! As y'all may have seen, we have a new pageviews API, with much finer granularity and better recall than the existing data. Since I had advance notice of the release, I was able to put together an R client already - you can get it at https://github.com/Ironholds/pageviews if R is your language of choice, and it'll be up on CRAN shortly. Thanks, -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] R client for the new Pageviews API
Huh, did Dan not send it to the research list? Curses! See https://lists.wikimedia.org/pipermail/analytics/2015-November/004529.html On 17 November 2015 at 22:12, Giovanni Luca Ciampaglia <gciam...@indiana.edu> wrote: > Interesting! I didn't know a new API had been released :-) > > When can I find more documentation about it? > > Cheers, > > G > > > Giovanni Luca Ciampaglia ∙ Assistant Research Scientist, Indiana University > > > On Tue, Nov 17, 2015 at 9:55 PM, Oliver Keyes <oke...@wikimedia.org> wrote: >> >> Shall do! I'm already linking in the internal documentation :) >> >> On 17 November 2015 at 21:11, Madhumitha Viswanathan >> <mviswanat...@wikimedia.org> wrote: >> > Woot! Nice :) Would be cool to link to the API docs from your README >> > too. >> > >> > On Tue, Nov 17, 2015 at 5:54 PM, Oliver Keyes <oke...@wikimedia.org> >> > wrote: >> >> >> >> Hey! >> >> >> >> As y'all may have seen, we have a new pageviews API, with much finer >> >> granularity and better recall than the existing data. Since I had >> >> advance notice of the release, I was able to put together an R client >> >> already - you can get it at https://github.com/Ironholds/pageviews if >> >> R is your language of choice, and it'll be up on CRAN shortly. >> >> >> >> Thanks, >> >> >> >> -- >> >> Oliver Keyes >> >> Count Logula >> >> Wikimedia Foundation >> >> >> >> ___ >> >> Analytics mailing list >> >> analyt...@lists.wikimedia.org >> >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> > >> > >> > >> > -- >> > --Madhu :) >> > >> > ___ >> > Analytics mailing list >> > analyt...@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> >> >> -- >> Oliver Keyes >> Count Logula >> Wikimedia Foundation >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > ___ > Analytics mailing list > analyt...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] R client for the new Pageviews API
Shall do! I'm already linking in the internal documentation :) On 17 November 2015 at 21:11, Madhumitha Viswanathan <mviswanat...@wikimedia.org> wrote: > Woot! Nice :) Would be cool to link to the API docs from your README too. > > On Tue, Nov 17, 2015 at 5:54 PM, Oliver Keyes <oke...@wikimedia.org> wrote: >> >> Hey! >> >> As y'all may have seen, we have a new pageviews API, with much finer >> granularity and better recall than the existing data. Since I had >> advance notice of the release, I was able to put together an R client >> already - you can get it at https://github.com/Ironholds/pageviews if >> R is your language of choice, and it'll be up on CRAN shortly. >> >> Thanks, >> >> -- >> Oliver Keyes >> Count Logula >> Wikimedia Foundation >> >> ___ >> Analytics mailing list >> analyt...@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > --Madhu :) > > _______ > Analytics mailing list > analyt...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Verifying claims about ENWP project size
stinfo/wiki-research-l >>> >> >> >> >> -- >> Jonathan T. Morgan >> Senior Design Researcher >> Wikimedia Foundation >> User:Jmorgan (WMF) >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Has the recent increase in English wikipedia's core community gone beyond a statistical blip?
Until we can prove it is good data we should treat it as good data is not how data works. Absent exactly that analysis it is almost certainly a bad idea for us to declare this to be good news; validate, /then/ celebrate. On 24 August 2015 at 12:26, WereSpielChequers werespielchequ...@gmail.com wrote: 100 edits a month does indeed have the disadvantage that all edits are not equal, there may be some people for whom that represents 100 hours contributed, others a single hour. So an individual month could be inflated by something as trivial as a vandalfighting bot going down for a couple of days and a bunch of oldtimers responding to a call on IRC by coming back and running huggle for an hour. But 7 months in a row where the total is higher than the same month the previous year looks to me like a pattern. Across the 3,000 or so editors on English wikipedia who contribute over a hundred edits per month there could be a hidden pattern of an increase in Huggle, stiki and AWB users more than offsetting a decline in manual editing, but unless anyone analyses that and reruns those stats on some metric such as unique calender hours in which someone saves an edit I think it best to treat this as an imperfect indicator of community health. I'm not suggesting that we are out of the woods - there are other indicators that are still looking bad, and I would love to see a better proxy for active editors. But this is good news. On 23 August 2015 at 19:31, Mark J. Nelson m...@anadrome.org wrote: WereSpielChequers werespielchequ...@gmail.com writes: Could you be more specific re In general I'm not sure the 100+ count is among the most reliable. What in particular do you think is unreliable about that metric? The main thing I have questions about with that metric is whether it's a good proxy for editing activity in general, or is dominated by fluctuations in bookkeeping contributions, i.e. people doing mass-moves of categories and that kind of thing (which makes it quite easy to get to 100 edits). This has long been a complaint about edit counts as a metric, which have never really been solidly validated. Looking through my own personal editing history, it looks like there's an anti-correlation between hitting the 100-edit threshold and making more substantial edits. In months when I work on article-writing I typically have only 20-30 edits, because each edit takes a lot of library research, so I can't make more than one or two a day. In months where I do more bookkeeping-type edits I can easily have 500 or 1000 edits. But that's just for me; it's certainly possible that Wikipedia-wide, there's a good correlation between raw edit count and other kinds of desirable activity measures. But is there evidence of that? -- Mark J. Nelson Anadrome Research http://www.kmjn.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Count Logula Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to read blobs in text table?
If we're talking Wikimedia Mediawiki instances, yes, the API is your only way forward - for performance reasons the text content is stored in a totally different set of servers that (to my knowledge) even paid researchers don't get to mess around with. Alternately you could take a look at https://dumps.wikimedia.org if slightly outdated information is okay to you. On 29 July 2015 at 18:58, Srijan Kumar srijanke...@gmail.com wrote: Hi! I want to read the text stored in the text tables[1], but the old_text field stores it as what seems to be the path to the blob. How can I get the content of the blob? Alternately, is there any other way to access all text content (including deleted content) without requiring global rights to the API? Thanks! Srijan [1] https://www.mediawiki.org/wiki/Manual:Text_table ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Trying to find a paper..
If anyone has a copy of Rehurek Kolkus's Language Identification on the Web: Extending the Dictionary Method from 2009, could they send it to me? Thanks! -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Trying to find a paper..
Woah, now resolved! This community is awesome :D On 28 July 2015 at 15:01, Oliver Keyes oke...@wikimedia.org wrote: If anyone has a copy of Rehurek Kolkus's Language Identification on the Web: Extending the Dictionary Method from 2009, could they send it to me? Thanks! -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Aidez à améliorer l'exhaustivité de Wikipédia en français
have an account with the same username in the source language, have made at least one edit in both the source and target Wikipedias, have made at least one edit in either language within the last year and have matching email addresses for the two accounts. Based on the feedback from the test, it is clear that we need to raise the bar on the contributions to source/destination languages for the future steps. We initially had a 100 byte limit in each of the source and destination language in the past year as a bar, but that one somehow didn't get to the code (code issue) and we didn't realize this until we received the feedback. Based on the feedback, we may want to consider even higher bars for choosing editors, one thing we do not want to do is to ignore those with few edits completely. Those may be people who have contributed few times and recommendations can encourage them to contribute more and come back. Any feedback on how we can improve this aspect further is appreciated. On Fri, Jun 26, 2015 at 6:58 AM, Samuel Klein meta...@gmail.com wrote: Interesting, I figured I received the mail because of joining translation projects. It seems that it's enough to have made a single edit in both language wikipedias in the last year. we changed the wording of the page to make it clearer. I think there was a confusion caused by our wording. please read here. I hope you will do this in both directions for each language pair (both suggestions from FR -- EN and from EN -- FR.) the way the algorithm makes the final recommendations is language agnostic so we can easily expand them to other language pairs. the goal is to have them for the top 30 languages (to and from), the top 50 if we have enough data to make good enough recommendations. We do hope that the engineering aspect of receiving these recommendations can also move as fast so we can offer the editors the recommendations in a way that works smoothly with their workflow. On Fri, Jun 26, 2015 at 8:32 AM, Jim tro...@gmail.com wrote: I strongly disagree that this is spamming. Like others have mentioned, I was not offended by the email (though I wasn't delighted) by it either, I think it is a reasonable attempt to encourage editors to put some efforts into languages other than English. Plus it is easy to unsubscribe from the research mailing list. Thanks for sharing your point of view and happy to hear we did not bother you by it. As mentioned earlier, I hope that we (all parties involved, not just research) can resolve the email conversation in a way that more people are happier with the outcome. On Fri, Jun 26, 2015 at 9:38 AM, Ziko van Dijk zvand...@gmail.com wrote: Spamming - a question what the e-mail function of WP is ment for. I was very surprised to get the request though my French is limited, I hardly ever edited on fr.WP, The feedback about limited French language knowledge is a great feedback that we have heard clearly. Thank you for sharing that and sorry that you were chosen. This is something we have already changed in our code to increase the threshold on the way we choose future participants. and the suggested topics have totally nothing to do with what I do on Wikipedia. So I do think that the mail was not quite appropriate, and it gives me a not so favorable impression about the people or initiative behind. I'm sorry if the recommendation has disappointed you. As mentioned in the recommendation email, you will be in one of the two groups: those who receive random but still important (with the algorithm's definition of importance) recommendations or those who receive personalized and important recommendations. Since we have not finalized the analysis of the test I cannot look to see which group you were in since that may have impact on the results. I hope this helps us build more trust, and hopefully we can learn much more when the results are out. Thank you for your time. Thanks again everyone. I will continue monitoring this list. We are also busy with the talk page so you may experience some delay. Apologize in advance if that happens. Just be sure that we will get back to you. :-) Best, Leila ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Anti-harassment policies within the R community
Hey all, This isn't directly WP-related, except in the sense that it relates to a toolset I think a lot of researchers in HCI tend to use (heck, a lot of researchers, full stop) As any of you who've had the misfortune to spend more than 5 minutes talking to me in the last few years will know, I'm kind of fanatical about the R programming language. It's commonly used within statistical analysis and even Python users wander over to R land for graphing ;). While the Python Software Foundation has an anti-harassment policy for conferences they run or sponsor, the R Foundation does not - and so we've started an open letter asking for one to be instituted. The letter lives at https://docs.google.com/document/d/1C1oPhup72lPHJXbpyJZNIo1BdCzfxZ_VoiWJWzmufjU/edit If you're an R programmer or someone with an interest in that field, and you're interested in signing, simply: 1. Put your name in as a suggested edit, in the format existing signers are using; 2. Email me so I can confirm that the person signing as Foo Bar is genuinely Foo Bar; 3. Done! Many thanks, -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Community health (retitled thread)
We should concentrate on factual data for research in a long email about how everything is ruined forever because a moderator couldn't find anything of value in an uncited claim that Jan-Bart actively drove people away? This must be what people mean by mixed methods ;) On 4 June 2015 at 18:29, Juergen Fenn jf...@gmx.net wrote: Am 04.06.2015 um 19:11 schrieb Federico Leva (Nemo) nemow...@gmail.com: Reduced traffic on Wikimedia-l is mostly due to list moderation. That's plausible. Most people on wikimedia-l are moderated by now; I and others unsubscribed due to tyrannical moderation, too. Well, not exactly tyrannical, as there obviously is a plan behind it not to let anything critical sound on a list that no longer serves the community, but that is just another tool for corporate communication. Of course, I also unsubscribed from the list, and I will read it from the archive only because I don't subscribe to corporate communication lists. I agree to Aaron that we should concentrate on factual data for research. However, I'd like to give you an idea of what nowadays is no longer possible on Wikimedia-l because this also serves as an indicator for the health of Wikipedia. The text of my censored email read: Jan-Bart, you might be aware that it was you that drove many talented candidates out of the movement last year. So, no more comment. And the moderator gave this comment for his decision: I could not find anything positive in your message. How would it help the goals of the Wikimedia movement? Regards, Richard. That's the way Wikipedia works in 2015. So, no more comments about community health or whoever's health. I'm off. Best, Jürgen. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Community health (retitled thread)
Anecdata, but: as someone who no longer posts to wikimedia-l, I stopped posting because I find it a fundamentally toxic place to be. On 4 June 2015 at 13:55, Aaron Halfaker aaron.halfa...@gmail.com wrote: Hi Claudia, which of Juergen's statements do you mean? All of them, but mostly the explanation for the drop in traffic. do you have any evidence for the contrary? Please don't assume that my call for evidence suggests my disagreement. I'm an empiricist and this is the research mailing list. IMO, claims need evidence or should be carefully framed as speculation or hypothesis. In this context, it is good practice to request that those making statements of fact produce justification. It seems like it would be helpful to move this conversation forward if someone were to find the dates of the policy change and compare it to the rate of moderated messages. I also suggest that any further discussion about WMF policies or board decisions (outside of their measurable effects, theoretical implications, etc.) be taken to a more appropriate forum. -Aaron On Thu, Jun 4, 2015 at 10:05 AM, koltzenb...@w4w.net wrote: Hi Aaron, which of Juergen's statements do you mean? my question is: do you have any evidence for the contrary? best, Claudia -- Original Message --- From:Aaron Halfaker aaron.halfa...@gmail.com To:Research into Wikimedia content and communities wiki-research- l...@lists.wikimedia.org Sent:Thu, 4 Jun 2015 09:55:02 -0500 Subject:Re: [Wiki-research-l] Community health (retitled thread) Hi Juergen, That's an interesting hypothesis. Do you have any evidence to support it? On Thu, Jun 4, 2015 at 9:50 AM, Juergen Fenn jf...@gmx.net wrote: Am 04.06.2015 um 16:33 schrieb Samuel Klein meta...@gmail.com: Context: reduced traffic on wikimedia-l. Is this a sign of poor community health? Reduced traffic on Wikimedia-l is mostly due to list moderation. All critical content has been filtered for a while. I became aware of it only recently when I posted a critical remark which was rejected. The Foundation has not only introduced Superprotect and Superban, it has also got a firm grip on all communication channels whatsoever. This will intensify as, we have just learned, new staff will be hired for communication. The WMF no longer needs the community, it does all the traffic itself (staff, chapters, etc.). Of course this shift from crowdsourcing to staff has to be paid for, hence the interest in an ever-increasing flow of donations and hence the interest in the Alexa ranking of Wikipedia. Best, Jürgen. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l --- End of Original Message --- ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Two articles of interest regarding Wikipedia research
But the most interesting finding, at least for Greenstein, was that Wikipedia articles with more revision in them had less bias and were less likely to lean Democratic IOW, articles that are genuinely crowdsourced are more neutral. Good job, clickbait headline writer. On 3 June 2015 at 16:48, Pine W wiki.p...@gmail.com wrote: Can Wikipedia Be Trusted? http://insight.kellogg.northwestern.edu/article/can-wikipedia-be-trusted The Unknown Perils of Mining Wikipedia https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/ Pine ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wiki-research-l Digest, Vol 117, Issue 14
I can happily check the sampled logs for hits to those pages prior to and on those dates, if that'd help? On 11 May 2015 at 23:08, R.Stuart Geiger sgei...@gmail.com wrote: Going from 86,000,000 a month to 31,000 a month is quite a drop, and the shift is pretty dramatic. It goes from 1.7 million one day to 715 the next and stays flat (http://stats.grok.se/en/201410/Special:Random). I was also thinking there could be a bot or something that is scraping Special:Random, but the drop also happens for Special:Random/Talk -- which hardly anybody uses, but it still drops flat the same day ( http://stats.grok.se/en/201410/Special:Random/Talk). It doesn't happen for Special:Upload or Special:Log though. October 16th, 2014 is the day it changes. Anybody know of something that might have changed that day with logging? Also, there have to be way more than ~1,000 hits a day to Special:Random. Perhaps pageviews started to be counted for the page that it got redirected to, rather than the Special:Random page itself. But then why wouldn't it go to 0? What are those ~1,000 hits a day? [image: ] ~~ it is a mystery ~~ [image: ] On Mon, May 11, 2015 at 7:44 PM, Oliver Keyes oke...@wikimedia.org wrote: A reduction or alteration in automata activity, possibly? Erik's dumps contain literally no filtering for scammers or crawlers, and we're a hot locale for spammer activity. On 11 May 2015 at 08:09, Alex Druk alex.d...@gmail.com wrote: I just grep monthly totals from Erik Zachte http://dumps.wikimedia.org/other/pagecounts-ez/merged/ (grep ^en.z Special:Random ) On Mon, May 11, 2015 at 2:00 PM, wiki-research-l-requ...@lists.wikimedia.org wrote: Send Wiki-research-l mailing list submissions to wiki-research-l@lists.wikimedia.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-requ...@lists.wikimedia.org You can reach the person managing the list at wiki-research-l-ow...@lists.wikimedia.org When replying, please edit your Subject line so it is more specific than Re: Contents of Wiki-research-l digest... Today's Topics: 1. Re: How to explain drop in random searches (Oliver Keyes) -- Message: 1 Date: Sun, 10 May 2015 08:30:37 -0400 From: Oliver Keyes oke...@wikimedia.org To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] How to explain drop in random searches Message-ID: caauqgda6jvzgs3qqxvvgsh7mfthwxpd97uisdtjmjnugzzx...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 Using what data? On 10 May 2015 at 05:29, Alex Druk alex.d...@gmail.com wrote: Hi everyone, I try to learn dynamic of random searches (Special:Random) on English Wikipedia. From 01/2012 to 10/2014 average number of random searches per month was about 86 millions or about 30% of Main_Page pageviews, but from November 2014 it drop to 31,000 per month (or 0.008% of Main_page). How to explain such a dramatic drop? Any ideas? -- Thank you. Alex Druk, PhD wikipediatrends.com alex.d...@gmail.com (775) 237-8550 Google voice ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation -- ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l End of Wiki-research-l Digest, Vol 117, Issue 14 -- Thank you. Alex Druk alex.d...@gmail.com (775) 237-8550 Google voice ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wiki-research-l Digest, Vol 117, Issue 14
A reduction or alteration in automata activity, possibly? Erik's dumps contain literally no filtering for scammers or crawlers, and we're a hot locale for spammer activity. On 11 May 2015 at 08:09, Alex Druk alex.d...@gmail.com wrote: I just grep monthly totals from Erik Zachte http://dumps.wikimedia.org/other/pagecounts-ez/merged/ (grep ^en.z Special:Random ) On Mon, May 11, 2015 at 2:00 PM, wiki-research-l-requ...@lists.wikimedia.org wrote: Send Wiki-research-l mailing list submissions to wiki-research-l@lists.wikimedia.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wiki-research-l or, via email, send a message with subject or body 'help' to wiki-research-l-requ...@lists.wikimedia.org You can reach the person managing the list at wiki-research-l-ow...@lists.wikimedia.org When replying, please edit your Subject line so it is more specific than Re: Contents of Wiki-research-l digest... Today's Topics: 1. Re: How to explain drop in random searches (Oliver Keyes) -- Message: 1 Date: Sun, 10 May 2015 08:30:37 -0400 From: Oliver Keyes oke...@wikimedia.org To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] How to explain drop in random searches Message-ID: caauqgda6jvzgs3qqxvvgsh7mfthwxpd97uisdtjmjnugzzx...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 Using what data? On 10 May 2015 at 05:29, Alex Druk alex.d...@gmail.com wrote: Hi everyone, I try to learn dynamic of random searches (Special:Random) on English Wikipedia. From 01/2012 to 10/2014 average number of random searches per month was about 86 millions or about 30% of Main_Page pageviews, but from November 2014 it drop to 31,000 per month (or 0.008% of Main_page). How to explain such a dramatic drop? Any ideas? -- Thank you. Alex Druk, PhD wikipediatrends.com alex.d...@gmail.com (775) 237-8550 Google voice ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation -- ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l End of Wiki-research-l Digest, Vol 117, Issue 14 -- Thank you. Alex Druk alex.d...@gmail.com (775) 237-8550 Google voice ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Makes sense! I actually hadn't factored in that sort of action (although it does happen), more: the order of the main page links on the root www.wikipedia.org page. On 7 May 2015 at 03:51, Scott Hale computermacgy...@gmail.com wrote: The accept-language header is the obvious place to start, but there is amble scope to combine multiple approaches together. In addition to accept-language and geolocation data, any logged in user will have view/edit history related to multiple editions. If the user is requesting a specific article, (e.g., https://www.wikipedia.org/wiki/普天間飛行場 ) we also can take account of what editions actually have the article --- the vast majority of content on Wikipedia only exists in one language or a few languages. (I.e., the above link redirects me to create the article on en-wiki although it exists on ja-wiki and Japanese is my second preferred language by my accept-language header and is an edition I edit captured in my edit history) This isn't an either-or question of which to use, but rather a question of how all these indicators can be used together to create the best experience. I would venture that most users don't change their accept-language header (not even possible on some mobile browsers!) and hence probably list give only one language. If so, geography and edit history can be signals for possible second languages beyond the one language in the accept-language header when hitting the homepage without a specific article. Cheers, Scott P.S. It looks like the Universal Language Selector already uses the accept-language header for its preference screen. On Thu, May 7, 2015 at 5:58 AM, Oliver Keyes oke...@wikimedia.org wrote: As I've now said...4 times, I don't think we'd be using geolocation. We'd be using the accept-language header. See https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language On 7 May 2015 at 00:52, WereSpielChequers werespielchequ...@gmail.com wrote: When a reader comes to Wikipedia from the web we can detect their IP address and that usually geolocates them to a country. More often than not that then tells you the dominant language of that country. If we were to default to official or dominant languages then I predict endless arguments as to which language(s) should be the default in which countries. The large expat community in some parts of the Arab world might prefer English over Arabic. India would want to do things by state, and a whole new front would emerge in the Israeli Palestine debate. Regards Jonathan Cardy ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Thanks for the bugs, Nemo! (search team: should we take those over?) On 7 May 2015 at 03:08, Federico Leva (Nemo) nemow...@gmail.com wrote: Thanks for looking into www.wikipedia.org traffic from India; I've been complaining about it for a while. :) See also: * https://phabricator.wikimedia.org/T26767 * https://phabricator.wikimedia.org/T5665 Mark J. Nelson, 07/05/2015 04:24: But for the average Copenhagener, the following order is far more likely: * Danish, English, Norwegian Bokmål, ... This is something you can help fix. Please do! https://www.mediawiki.org/wiki/ULS/FAQ#language-territory Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Interesting! This I didn't know; I'll factor it in :). On 7 May 2015 at 04:48, Stuart A. Yeates syea...@gmail.com wrote: Accept-language is systematically broken for minority languages within dominant language communities. In New Zealand, a country with three official languages and a textbook case of language revivalism, I've never met anyone without a degree in computer science who sets accept-language, and I've never seen a computer system which ships with all three official languages selectable. Most computer systems ship with en or en-us as the default. If there were silver bullets in this area, the solution would be obvious and we wouldn't even be thinking about having this conversation. cheers stuart On Thursday, May 7, 2015, Oliver Keyes oke...@wikimedia.org wrote: As I've now said...4 times, I don't think we'd be using geolocation. We'd be using the accept-language header. See https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language On 7 May 2015 at 00:52, WereSpielChequers werespielchequ...@gmail.com wrote: When a reader comes to Wikipedia from the web we can detect their IP address and that usually geolocates them to a country. More often than not that then tells you the dominant language of that country. If we were to default to official or dominant languages then I predict endless arguments as to which language(s) should be the default in which countries. The large expat community in some parts of the Arab world might prefer English over Arabic. India would want to do things by state, and a whole new front would emerge in the Israeli Palestine debate. Regards Jonathan Cardy On 7 May 2015, at 05:06, Sam Katz smk...@gmail.com wrote: hey guys, you can't guess geolocation, because occasionally you'd be wrong. this happens to me all the time. I want to read a site in spanish... and then it thinks I'm in Latin America, when I'm not. --Sam On Wed, May 6, 2015 at 10:07 PM, Oliver Keyes oke...@wikimedia.org wrote: Possibly. But that sounds potentially wooly and sometimes inaccurate. When a browser makes a web request, it sends a header called the accept_language header (https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language) which indicates what languages the browser finds ideal - i.e., what languages the user and system are using. If we're going to make modifications here (I hope we will. But again; early days) I don't see a good argument for using geolocation, which is, as you've noted, flawed without substantial time and energy being applied to map those countries to probable languages. The data the browser already sends to the server contains the /certain/ languages. We can just use that. On 6 May 2015 at 22:50, Stuart A. Yeates syea...@gmail.com wrote: This seems like a great place to use analytics data, for each division in the geo-location classification, rank each of the languages by usage and present the top N as likely candidates (+ browser settings) when we need the user to pick a language. cheers stuart -- ...let us be heard from red core to black sky On Thu, May 7, 2015 at 2:24 PM, Mark J. Nelson m...@anadrome.org wrote: Stuart A. Yeates syea...@gmail.com writes: Reading that excellent presentation, the thought that struck me was: If I wanted to subvert the assumption that Wikipedia == en.wiki, linking to http://www.wikipedia.org/ is what I'd do. A smarter http://www.wikipedia.org/ might guess geo-location and thus local languages. I'd also like to see something smarter done at the main page, but the and thus bit here is notoriously tricky. For example most geolocation-based things, like Wikidata by default, tend to produce funny results in Denmark. A Copenhagener is offered something like this choice, in order: * Danish, Greelandic, Faroese, Swedish, German, ... The reasoning here is that Danish, Greenlandic, and Faroese are official languages of the Danish Realm, which includes both Denmark proper, and two autonomous territories, Greeland and the Faroe Islands. And then Sweden and Germany are the two neighboring countries. But for the average Copenhagener, the following order is far more likely: * Danish, English, Norwegian Bokmål, ... The reason here is that Norwegian Bokmål is very close to Danish in written form (more than Swedish is, and especially more than Faroese is) while English is a widely used semi-official language in business, government, and education (for example about half of university theses are now written in English, and several major companies use it as their official workplace language). I think it's possible to come up with something that better aligns with readers' actual preferences, but it's not easy! -Mark -- Mark J. Nelson Anadrome Research http://www.kmjn.org
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Traffic through Wikipedia zero; apologies for not being clear. On 6 May 2015 at 19:56, Sam Katz smk...@gmail.com wrote: hey oliver, I don't mean to be a help vampire... but what is zero traffic? you think the traffic is being proxied? perhaps even reverse proxied? --Sam On Wed, May 6, 2015 at 1:40 PM, Oliver Keyes oke...@wikimedia.org wrote: Cross-posting to research and analytics, too! -- Forwarded message -- From: Oliver Keyes oke...@wikimedia.org Date: 6 May 2015 at 13:11 Subject: Traffic to the portal from Zero providers To: wikimedia-sea...@lists.wikimedia.org Hey all, (Throwing this to the public list, because transparency is Good) I recently did a presentation on a traffic analysis to the Wikipedia home page - www.wikipedia.org.[1] One of the biggest visualisations, in impact terms, showed that a lot of portal traffic - far more, proportionately, than traffic to Wikipedia overall - is coming from India and Brazil.[2] One of the hypotheses was that this could be Zero traffic. I've done a basic analysis of the traffic, looking specifically at the zero headers,[3] and this hypothesis turns out to be incorrect - almost no zero traffic is hitting the portal. The traffic we're seeing from Brazil and India is not zero-based. This makes a lot of sense (the reason mobile traffic redirects to the enwiki home page from the portal is the Zero extension, so presumably this happens specifically to Zero traffic) but it does mean that our null hypothesis - that this traffic is down to ISP-level or device-level design choices and links - is more likely to be correct. [1] http://ironholds.org/misc/homepage_presentation.html [2] http://ironholds.org/misc/homepage_presentation.html#/11 [3] https://phabricator.wikimedia.org/T98076 -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Cross-posting to research and analytics, too! -- Forwarded message -- From: Oliver Keyes oke...@wikimedia.org Date: 6 May 2015 at 13:11 Subject: Traffic to the portal from Zero providers To: wikimedia-sea...@lists.wikimedia.org Hey all, (Throwing this to the public list, because transparency is Good) I recently did a presentation on a traffic analysis to the Wikipedia home page - www.wikipedia.org.[1] One of the biggest visualisations, in impact terms, showed that a lot of portal traffic - far more, proportionately, than traffic to Wikipedia overall - is coming from India and Brazil.[2] One of the hypotheses was that this could be Zero traffic. I've done a basic analysis of the traffic, looking specifically at the zero headers,[3] and this hypothesis turns out to be incorrect - almost no zero traffic is hitting the portal. The traffic we're seeing from Brazil and India is not zero-based. This makes a lot of sense (the reason mobile traffic redirects to the enwiki home page from the portal is the Zero extension, so presumably this happens specifically to Zero traffic) but it does mean that our null hypothesis - that this traffic is down to ISP-level or device-level design choices and links - is more likely to be correct. [1] http://ironholds.org/misc/homepage_presentation.html [2] http://ironholds.org/misc/homepage_presentation.html#/11 [3] https://phabricator.wikimedia.org/T98076 -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Agreed! That's one of the changes I'd really like to push ahead with, although we're going to do some more in-depth data collection before any redesign :). On 6 May 2015 at 20:27, Stuart A. Yeates syea...@gmail.com wrote: Reading that excellent presentation, the thought that struck me was: If I wanted to subvert the assumption that Wikipedia == en.wiki, linking to http://www.wikipedia.org/ is what I'd do. A smarter http://www.wikipedia.org/ might guess geo-location and thus local languages. cheers stuart -- ...let us be heard from red core to black sky On Thu, May 7, 2015 at 6:40 AM, Oliver Keyes oke...@wikimedia.org wrote: Cross-posting to research and analytics, too! -- Forwarded message -- From: Oliver Keyes oke...@wikimedia.org Date: 6 May 2015 at 13:11 Subject: Traffic to the portal from Zero providers To: wikimedia-sea...@lists.wikimedia.org Hey all, (Throwing this to the public list, because transparency is Good) I recently did a presentation on a traffic analysis to the Wikipedia home page - www.wikipedia.org.[1] One of the biggest visualisations, in impact terms, showed that a lot of portal traffic - far more, proportionately, than traffic to Wikipedia overall - is coming from India and Brazil.[2] One of the hypotheses was that this could be Zero traffic. I've done a basic analysis of the traffic, looking specifically at the zero headers,[3] and this hypothesis turns out to be incorrect - almost no zero traffic is hitting the portal. The traffic we're seeing from Brazil and India is not zero-based. This makes a lot of sense (the reason mobile traffic redirects to the enwiki home page from the portal is the Zero extension, so presumably this happens specifically to Zero traffic) but it does mean that our null hypothesis - that this traffic is down to ISP-level or device-level design choices and links - is more likely to be correct. [1] http://ironholds.org/misc/homepage_presentation.html [2] http://ironholds.org/misc/homepage_presentation.html#/11 [3] https://phabricator.wikimedia.org/T98076 -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Totally! As said, I think accept-language is a better variable to operate from. But these are early days; we're just beginning to understand the space. Realistically, software changes will come a lot later :) On 6 May 2015 at 22:24, Mark J. Nelson m...@anadrome.org wrote: Stuart A. Yeates syea...@gmail.com writes: Reading that excellent presentation, the thought that struck me was: If I wanted to subvert the assumption that Wikipedia == en.wiki, linking to http://www.wikipedia.org/ is what I'd do. A smarter http://www.wikipedia.org/ might guess geo-location and thus local languages. I'd also like to see something smarter done at the main page, but the and thus bit here is notoriously tricky. For example most geolocation-based things, like Wikidata by default, tend to produce funny results in Denmark. A Copenhagener is offered something like this choice, in order: * Danish, Greelandic, Faroese, Swedish, German, ... The reasoning here is that Danish, Greenlandic, and Faroese are official languages of the Danish Realm, which includes both Denmark proper, and two autonomous territories, Greeland and the Faroe Islands. And then Sweden and Germany are the two neighboring countries. But for the average Copenhagener, the following order is far more likely: * Danish, English, Norwegian Bokmål, ... The reason here is that Norwegian Bokmål is very close to Danish in written form (more than Swedish is, and especially more than Faroese is) while English is a widely used semi-official language in business, government, and education (for example about half of university theses are now written in English, and several major companies use it as their official workplace language). I think it's possible to come up with something that better aligns with readers' actual preferences, but it's not easy! -Mark -- Mark J. Nelson Anadrome Research http://www.kmjn.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
Possibly. But that sounds potentially wooly and sometimes inaccurate. When a browser makes a web request, it sends a header called the accept_language header (https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language) which indicates what languages the browser finds ideal - i.e., what languages the user and system are using. If we're going to make modifications here (I hope we will. But again; early days) I don't see a good argument for using geolocation, which is, as you've noted, flawed without substantial time and energy being applied to map those countries to probable languages. The data the browser already sends to the server contains the /certain/ languages. We can just use that. On 6 May 2015 at 22:50, Stuart A. Yeates syea...@gmail.com wrote: This seems like a great place to use analytics data, for each division in the geo-location classification, rank each of the languages by usage and present the top N as likely candidates (+ browser settings) when we need the user to pick a language. cheers stuart -- ...let us be heard from red core to black sky On Thu, May 7, 2015 at 2:24 PM, Mark J. Nelson m...@anadrome.org wrote: Stuart A. Yeates syea...@gmail.com writes: Reading that excellent presentation, the thought that struck me was: If I wanted to subvert the assumption that Wikipedia == en.wiki, linking to http://www.wikipedia.org/ is what I'd do. A smarter http://www.wikipedia.org/ might guess geo-location and thus local languages. I'd also like to see something smarter done at the main page, but the and thus bit here is notoriously tricky. For example most geolocation-based things, like Wikidata by default, tend to produce funny results in Denmark. A Copenhagener is offered something like this choice, in order: * Danish, Greelandic, Faroese, Swedish, German, ... The reasoning here is that Danish, Greenlandic, and Faroese are official languages of the Danish Realm, which includes both Denmark proper, and two autonomous territories, Greeland and the Faroe Islands. And then Sweden and Germany are the two neighboring countries. But for the average Copenhagener, the following order is far more likely: * Danish, English, Norwegian Bokmål, ... The reason here is that Norwegian Bokmål is very close to Danish in written form (more than Swedish is, and especially more than Faroese is) while English is a widely used semi-official language in business, government, and education (for example about half of university theses are now written in English, and several major companies use it as their official workplace language). I think it's possible to come up with something that better aligns with readers' actual preferences, but it's not easy! -Mark -- Mark J. Nelson Anadrome Research http://www.kmjn.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
One thing we could also do is check the accept_language header and prioritise around that; that way we'd be prioritising specifically the language the user's browser thinks they want. On 6 May 2015 at 21:28, Stuart A. Yeates syea...@gmail.com wrote: Probably also an excellent time to consider whether we can do anything for those languages which don't have wikis yet. For example, I'm in .nz, which has en, mi and nzs as official languages, but we're a long way from an nzs.wiki, given that ase.wiki is still in incubator. With the release of Unicode 8 with Sutton SignWriting in June, these may or may not kick off in a big way. cheers stuart -- ...let us be heard from red core to black sky On Thu, May 7, 2015 at 12:34 PM, Oliver Keyes oke...@wikimedia.org wrote: Agreed! That's one of the changes I'd really like to push ahead with, although we're going to do some more in-depth data collection before any redesign :). On 6 May 2015 at 20:27, Stuart A. Yeates syea...@gmail.com wrote: Reading that excellent presentation, the thought that struck me was: If I wanted to subvert the assumption that Wikipedia == en.wiki, linking to http://www.wikipedia.org/ is what I'd do. A smarter http://www.wikipedia.org/ might guess geo-location and thus local languages. cheers stuart -- ...let us be heard from red core to black sky On Thu, May 7, 2015 at 6:40 AM, Oliver Keyes oke...@wikimedia.org wrote: Cross-posting to research and analytics, too! -- Forwarded message -- From: Oliver Keyes oke...@wikimedia.org Date: 6 May 2015 at 13:11 Subject: Traffic to the portal from Zero providers To: wikimedia-sea...@lists.wikimedia.org Hey all, (Throwing this to the public list, because transparency is Good) I recently did a presentation on a traffic analysis to the Wikipedia home page - www.wikipedia.org.[1] One of the biggest visualisations, in impact terms, showed that a lot of portal traffic - far more, proportionately, than traffic to Wikipedia overall - is coming from India and Brazil.[2] One of the hypotheses was that this could be Zero traffic. I've done a basic analysis of the traffic, looking specifically at the zero headers,[3] and this hypothesis turns out to be incorrect - almost no zero traffic is hitting the portal. The traffic we're seeing from Brazil and India is not zero-based. This makes a lot of sense (the reason mobile traffic redirects to the enwiki home page from the portal is the Zero extension, so presumably this happens specifically to Zero traffic) but it does mean that our null hypothesis - that this traffic is down to ISP-level or device-level design choices and links - is more likely to be correct. [1] http://ironholds.org/misc/homepage_presentation.html [2] http://ironholds.org/misc/homepage_presentation.html#/11 [3] https://phabricator.wikimedia.org/T98076 -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: Traffic to the portal from Zero providers
As I've now said...4 times, I don't think we'd be using geolocation. We'd be using the accept-language header. See https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language On 7 May 2015 at 00:52, WereSpielChequers werespielchequ...@gmail.com wrote: When a reader comes to Wikipedia from the web we can detect their IP address and that usually geolocates them to a country. More often than not that then tells you the dominant language of that country. If we were to default to official or dominant languages then I predict endless arguments as to which language(s) should be the default in which countries. The large expat community in some parts of the Arab world might prefer English over Arabic. India would want to do things by state, and a whole new front would emerge in the Israeli Palestine debate. Regards Jonathan Cardy On 7 May 2015, at 05:06, Sam Katz smk...@gmail.com wrote: hey guys, you can't guess geolocation, because occasionally you'd be wrong. this happens to me all the time. I want to read a site in spanish... and then it thinks I'm in Latin America, when I'm not. --Sam On Wed, May 6, 2015 at 10:07 PM, Oliver Keyes oke...@wikimedia.org wrote: Possibly. But that sounds potentially wooly and sometimes inaccurate. When a browser makes a web request, it sends a header called the accept_language header (https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Accept-Language) which indicates what languages the browser finds ideal - i.e., what languages the user and system are using. If we're going to make modifications here (I hope we will. But again; early days) I don't see a good argument for using geolocation, which is, as you've noted, flawed without substantial time and energy being applied to map those countries to probable languages. The data the browser already sends to the server contains the /certain/ languages. We can just use that. On 6 May 2015 at 22:50, Stuart A. Yeates syea...@gmail.com wrote: This seems like a great place to use analytics data, for each division in the geo-location classification, rank each of the languages by usage and present the top N as likely candidates (+ browser settings) when we need the user to pick a language. cheers stuart -- ...let us be heard from red core to black sky On Thu, May 7, 2015 at 2:24 PM, Mark J. Nelson m...@anadrome.org wrote: Stuart A. Yeates syea...@gmail.com writes: Reading that excellent presentation, the thought that struck me was: If I wanted to subvert the assumption that Wikipedia == en.wiki, linking to http://www.wikipedia.org/ is what I'd do. A smarter http://www.wikipedia.org/ might guess geo-location and thus local languages. I'd also like to see something smarter done at the main page, but the and thus bit here is notoriously tricky. For example most geolocation-based things, like Wikidata by default, tend to produce funny results in Denmark. A Copenhagener is offered something like this choice, in order: * Danish, Greelandic, Faroese, Swedish, German, ... The reasoning here is that Danish, Greenlandic, and Faroese are official languages of the Danish Realm, which includes both Denmark proper, and two autonomous territories, Greeland and the Faroe Islands. And then Sweden and Germany are the two neighboring countries. But for the average Copenhagener, the following order is far more likely: * Danish, English, Norwegian Bokmål, ... The reason here is that Norwegian Bokmål is very close to Danish in written form (more than Swedish is, and especially more than Faroese is) while English is a widely used semi-official language in business, government, and education (for example about half of university theses are now written in English, and several major companies use it as their official workplace language). I think it's possible to come up with something that better aligns with readers' actual preferences, but it's not easy! -Mark -- Mark J. Nelson Anadrome Research http://www.kmjn.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki
[Wiki-research-l] [Announce] a new release of Pageviews data
Hey all, We've just released a count of pageviews to the English-language Wikipedia from 2015-03-16T00:00:00 to 2015-04-25T15:59:59, grouped by timestamp (down to a one-second resolution level) and site (mobile or desktop). The smallest number of events in a group is 645; because of this, we are confident there should not be privacy implications of releasing this data. We checked with legal first ;p. If you're interested in getting your mitts on it, you can find it at DataHub (http://datahub.io/dataset/english-wikipedia-pageviews-by-second) or FigShare (http://figshare.com/articles/English_Wikipedia_pageviews_by_second/1394684) -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wiki-research-l Digest, Vol 116, Issue 16
... URL: https://lists.wikimedia.org/pipermail/wiki-research-l/attachments/20150408/644001dd/attachment-0001.html -- Message: 2 Date: Wed, 8 Apr 2015 11:19:10 + From: Flöck, Fabian fabian.flo...@gesis.org To: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: Re: [Wiki-research-l] Research on Wikidata's content coverage Message-ID: 11a3af77-82f0-4b3e-a865-6010ebdfc...@gesis.org Content-Type: text/plain; charset=iso-8859-1 Hi Oliver, from the top of my head, two on gender coverage: the one Max just sent around: http://ijoc.org/index.php/ijoc/article/view/777/631 and another one, with a different approach, but a similar goal: http://arxiv.org/abs/1501.06307 We had one on diversity that also has a small section about representativeness of the editor base, although it might not be exactly what you are looking for: http://journal.webscience.org/432/1/112_paper.pdf Gruß, Fabian On 07.04.2015, at 21:50, Oliver Keyes oke...@wikimedia.orgmailto:oke...@wikimedia.org wrote: Hey all, Is anyone aware of research on the completeness of Wikidata, in terms of coverage and systemic bias? This seems like the sort of thing Max Klein might know ;). Papers, blog posts, anything. -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Cheers, Fabian -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.flo...@gesis.orgmailto:fabian.flo...@gesis.org www.gesis.org www.facebook.com/gesis.org -- next part -- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/wiki-research-l/attachments/20150408/c5205267/attachment-0001.html -- ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l End of Wiki-research-l Digest, Vol 116, Issue 16 ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Research on Wikidata's content coverage
Thanks both! I'm specifically looking at Wikidata's coverage, rather than Wikipedia's - in other words, work done on deficiencies in the mapping of wikimedia content onto wikidata content. On 8 April 2015 at 07:19, Flöck, Fabian fabian.flo...@gesis.org wrote: Hi Oliver, from the top of my head, two on gender coverage: the one Max just sent around: http://ijoc.org/index.php/ijoc/article/view/777/631 and another one, with a different approach, but a similar goal: http://arxiv.org/abs/1501.06307 We had one on diversity that also has a small section about representativeness of the editor base, although it might not be exactly what you are looking for: http://journal.webscience.org/432/1/112_paper.pdf Gruß, Fabian On 07.04.2015, at 21:50, Oliver Keyes oke...@wikimedia.org wrote: Hey all, Is anyone aware of research on the completeness of Wikidata, in terms of coverage and systemic bias? This seems like the sort of thing Max Klein might know ;). Papers, blog posts, anything. -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Cheers, Fabian -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.flo...@gesis.org www.gesis.org www.facebook.com/gesis.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Research on Wikidata's content coverage
Perfect; thank you! On 8 April 2015 at 09:53, Finn Årup Nielsen f...@imm.dtu.dk wrote: Dear Oliver, On 04/08/2015 03:38 PM, Oliver Keyes wrote: Thanks both! I'm specifically looking at Wikidata's coverage, rather than Wikipedia's - in other words, work done on deficiencies in the mapping of wikimedia content onto wikidata content. Oh, I didn't see it was Wikidata instead of Wikpedia. Wikipedia research and tools: Review and comments. http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6012/pdf/imm6012.pdf contains pointers to the Max Klein/Piotr Konieczny studies and Magnus Manske's Mix’n’match (presently page 11). Magnus Manske has a blog post recently: http://magnusmanske.de/wordpress/?p=278 Sex and artists If I remember correctly wikidata-l had some discussion about that. Probably you know that already. best Finn Årup Nielsen On 8 April 2015 at 07:19, Flöck, Fabian fabian.flo...@gesis.org wrote: Hi Oliver, from the top of my head, two on gender coverage: the one Max just sent around: http://ijoc.org/index.php/ijoc/article/view/777/631 and another one, with a different approach, but a similar goal: http://arxiv.org/abs/1501.06307 We had one on diversity that also has a small section about representativeness of the editor base, although it might not be exactly what you are looking for: http://journal.webscience.org/432/1/112_paper.pdf Gruß, Fabian On 07.04.2015, at 21:50, Oliver Keyes oke...@wikimedia.org wrote: Hey all, Is anyone aware of research on the completeness of Wikidata, in terms of coverage and systemic bias? This seems like the sort of thing Max Klein might know ;). Papers, blog posts, anything. -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Cheers, Fabian -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.flo...@gesis.org www.gesis.org www.facebook.com/gesis.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Finn Årup Nielsen http://people.compute.dtu.dk/faan/ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Research on Wikidata's content coverage
Hey all, Is anyone aware of research on the completeness of Wikidata, in terms of coverage and systemic bias? This seems like the sort of thing Max Klein might know ;). Papers, blog posts, anything. -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] rc stream
Well, if you got all your learning out of the way in the first email, I'm really confused as to what you thought a backhanded none of you are helping would do 3 hours later. You asked an honest question, you got a very reasonable and perfectly friendly reply, and then decided, I guess, that the thread really wouldn't be complete without denigrating the people who were trying to help. I'd like to expect more from list subscribers than that. On 7 April 2015 at 16:50, Ed Summers e...@pobox.com wrote: Ok, my apologies if this is coming out garbled. Here’s a list of things I think I’ve learned as part of this discussion: 1) currently there is no plan to do away with the IRC stream //Ed On Apr 7, 2015, at 4:45 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote: Oh! Well if you understood Yuvi right away, it seems that you *did* get a clear answer out of us all. On Tue, Apr 7, 2015 at 3:42 PM, Ed Summers e...@pobox.com wrote: On Apr 7, 2015, at 4:07 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote: Really, RCStream is what the IRC feed ought to have been -- and probably would have been if those standards were available at the time of its construction. RCStream solves the same problem better. Actually, I understood the first time and I agree with this assessment. I still don’t find it to be a compelling reason to go and do the work. But it probably would’ve taken about as long as it has to try to get a clear answer out of you all :-) //Ed ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] rc stream
Thanks; that's most appreciated. For what it's worth, I'll try not to instinctively go at anyone who's mean to Yuvi.[0] [0] Only I get to be mean to Yuvi. Me and whoever was cruel enough to make him run Labs ;p On 7 April 2015 at 16:58, Ed Summers e...@pobox.com wrote: On Apr 7, 2015, at 4:54 PM, Oliver Keyes oke...@wikimedia.org wrote: Well, if you got all your learning out of the way in the first email, I'm really confused as to what you thought a backhanded none of you are helping would do 3 hours later. You asked an honest question, you got a very reasonable and perfectly friendly reply, and then decided, I guess, that the thread really wouldn't be complete without denigrating the people who were trying to help. I'd like to expect more from list subscribers than that. I apologize if what I said was denigrating. I can see how it could’ve been seen that way, and I regret it. I sincerely appreciate the help that has been offered. And you’re right to expect more of list subscribers. I’ll do better. //Ed ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikipedia search logs needed
+1. Valerio, I assume you're a researcher familiar with anonymisation; you should cast your eye over the AOL search log debacle. The only way to completely sanitise the logs is to remove all the query strings. Simon, it sounds like performing this kind of sanitisation would undermine the work you're doing, which is unfortunate :(. However, if you can make a strong pitch for this being of value to the Wikimedia communit(y|ies) I would encourage you to make that pitch; maybe we can look into an NDA! You might want to check out https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_edits for an example of an anonymisation proposal Shilad submitted to accompany a request for dataset releases (mind you, I generally think everyone should read everything Shilad writes, but that's besides the point ;p) On 3 April 2015 at 11:03, Aaron Halfaker aaron.halfa...@gmail.com wrote: It turns out that anonymization is hard(see [1,2,3]). A quick web search would have made that clear. We do sometimes provide researchers with NDAs for the purposes of anonymizing data. Again, we have limited time an energy, so such NDAs have been (1) limited to work that is immediately relevant to our own and (2) exists for the purpose of anonymizing and making the data public -- so that everyone can benefit. For example, see an project aimed to release anonymized view logs[4]. That proposal has been in process for more than a year though because legal agreements with national research labs are Hard. It seems like search logs are a candidate for that process, but we'd need to see an anonymization proposal before moving forward. 1. https://en.wikipedia.org/wiki/K-anonymity 2. https://en.wikipedia.org/wiki/L-diversity 3. https://en.wikipedia.org/wiki/T-closeness 4. https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews -Aaron On Fri, Apr 3, 2015 at 9:47 AM, Valerio Schiavoni valerio.schiav...@gmail.com wrote: Those logs could have been cleaned up further and re-released, especially since the privacy issues had an impact only on small percentage of queries . Frankly, it's a pity that after the initial announcement they had to quickly retract. Nonetheless Wikimedia could release them for research purposes, asking interested users to sign NDA or such. I would be very surprised to discover that in 2015 there are no means to properly anonymize datasets and release them to the public. best, Valerio On Fri, Apr 3, 2015 at 4:38 PM, Aaron Halfaker aaron.halfa...@gmail.com wrote: Maybe someone managed to grab those logs before they took them offline. If they did, I hope they won't share. They were taken offline due to privacy issues. From the blog post: We’ve temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries. -Aaron On Fri, Apr 3, 2015 at 9:32 AM, Valerio Schiavoni valerio.schiav...@gmail.com wrote: There has been at least one attempt to release such data: http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/ Maybe someone managed to grab those logs before they took them offline. Similar but older logs are available here: http://www.wikibench.eu/ best, valerio On Fri, Apr 3, 2015 at 4:09 PM, Pine W wiki.p...@gmail.com wrote: Hi Oliver, Do we even record search logs? It might be a good idea if we didn't. Pine On Apr 3, 2015 6:16 AM, Simon Givoli givo...@gmail.com wrote: Thanks Oliver, Sorry if I wasn't clear enough. My dissertation will involve consented participants. Their search logs will be recorded while searching Wikipedia. The search logs will then be analyzed in order to find recurrent search patterns across participants. Before beginning the experiment, I want to check that I can indeed find patterns in search logs, using several different algorithms. The idea is to check these algorithms on Wikipedia search logs already available. Hence my request. Simon Message: 5 Date: Fri, 3 Apr 2015 09:37:33 +0300 From: Simon Givoli givo...@gmail.com To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia search logs needed Message-ID: CAN=5oghnqs+knobenmsb1cwhfm-w-mxgqas5gcnfn7qru1k...@mail.gmail.com Content-Type: text/plain; charset=utf-8 Hi, I'm looking for a dump or db of Wikipedia users search logs. I would like it to be with recent data, but it doesn't have to be extensive, even a small sample size would be sufficient. I aim to use this db to test a new research tool I'm developing for my dissertation. Can anyone point me to a relevant source? Thanks' Simon -- next part -- An HTML attachment was scrubbed... URL: https://lists.wikimedia.org/pipermail/wiki-research-l/attachments/20150403/e51fd17c/attachment-0001.html -- Message: 6 Date: Fri, 3 Apr 2015 02:47:54 -0400 From: Oliver Keyes
Re: [Wiki-research-l] Anyone have access to this article?
I have to say that a WMF staffer using their official WMF account to imply they're doing legitimate work to study, understand and improve community dynamics is not a good look Cheers, Oliver On 1 April 2015 at 18:02, Jonathan Morgan jmor...@wikimedia.org wrote: Well, I was going to print out a bunch of copies and then sell them down on the corner, but I guess now I'll just use it to inform the development of a coding scheme for rating civility in Wikipedia talkpage comments. - J On Wed, Apr 1, 2015 at 3:00 PM, Stuart A. Yeates syea...@gmail.com wrote: I think you mean might have been permissible, if the original request had included the intended use. cheers stuart -- ...let us be heard from red core to black sky On Thu, Apr 2, 2015 at 10:57 AM, Nicole Askin nask...@alumni.uwo.ca wrote: Stuart, this is permissible per Wiley's terms of use - Authorized Users may also transmit such material to a third-party colleague in hard copy or electronically for personal use or scholarly, educational, or scientific research or professional use. Nicole On Wed, Apr 1, 2015 at 2:52 PM, Stuart A. Yeates syea...@gmail.com wrote: I have to say that a WMF staffer using their official WMF account to ask community members to commit copyright infringement is not a good look. cheers stuart -- ...let us be heard from red core to black sky On Thu, Apr 2, 2015 at 10:48 AM, Jonathan Morgan jmor...@wikimedia.org wrote: http://onlinelibrary.wiley.com/doi/10./jcom.12123/abstract What Creates Interactivity in Online News Discussions? An Exploratory Analysis of Discussion Factors in User Comments on News Items If you have access, and can send me a PDF offline, I would be very grateful :) Cheers, Jonathan -- Jonathan T. Morgan Community Research Lead Wikimedia Foundation User:Jmorgan (WMF) jmor...@wikimedia.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Jonathan T. Morgan Community Research Lead Wikimedia Foundation User:Jmorgan (WMF) jmor...@wikimedia.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps
Thanks all for the awesome comments :). Will get to tomorrow morning![1] [1] East coast time. On 19 March 2015 at 20:37, aaron shaw aarons...@northwestern.edu wrote: Adding to Giovanni's points (all of which I agree with 100%): - This would be awesome! The pageviews are a super useful for many of us and cleaning them up a bit would save a lot of redundant work for many of us down the road. - If you don't have to collapse page views incoming from mobile and zero, I would recommend keeping them separate. That said, I haven't spent any time looking into it, and so I confess complete ignorance on this front. - I agree with you that page ids are better than titles. Great idea. - I don't think the byte information is/was useful in this dataset, so I agree with dumping that. - Backfill would be totally great. Happy to chat more if it seems helpful... a On Thu, Mar 19, 2015 at 7:13 PM, Giovanni Luca Ciampaglia gciam...@indiana.edu wrote: Hi Oliver, Tab-separation would be welcomed. Title normalisation would be *very* useful too. Another thing that could potentially save a lot of space would be to throw out all malformed requests, pieces of javascript, and similar junk. Not sure how difficult that would be though, without doing an actual query on the DB for the page id. For example, an excerpt from 20140101-00.gz (with only the title and views fields): 'اÙ�ØاÙ�Â_Ù�شباب'_Â_Ù�Ù�اطعÂ_Ù�ضØÙ�ة 1 '/javascript:document.location.href='/'_encodeURIComponent(document.getElementById('txt_input_text').value) 9 '03_Bonnie__Clyde 18 A_Night_at_the_Opera_(Queen_album) 57 '40s_on_4 2 '50s_on_5 1 '71_(film) 4 '74_Jailbreak 3 '77 1 '79-00_é�å�¶å�©åºÃ¯Â¿Â½å_±é��vol.8_ACå�¬å�±åºÃ¯Â¿Â½å��æ©Ã¯Â¿Â½æ§Ã¯Â¿Â½ 1 Cheers, G Giovanni Luca Ciampaglia ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciam...@indiana.edu 2015-03-13 12:06 GMT-07:00 Oliver Keyes oke...@wikimedia.org: So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers? Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs. *The new format* At the moment we have the format: project_notation - encoded_title - pageviews - bytes This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is h-inducing. What I'd like to use as a new format is: full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews This file would: 1. Include a header row; 2. Be formatted as a tab-separated, rather than space-separated, file; 3. Exclude bytecounts; 4. Include desktop and mobile pageview counts on the same line; 5. Use the full project URL (en.wikivoyage.org) instead of the pagecounts-specific notation (en.v) So, as a made-up example, instead of: de.m.v Florence 32 9024 de.v Florence 920 7570 we'd end up with: de.wikivoyage.org Florence 920 32 In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away. I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below). *The size constraints* There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller. *What I'm asking for* Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How
Re: [Wiki-research-l] (no subject)
Awesome work! It's interesting to see Finnish as the outlier here. Do we have any fi-users on the list who can comment on this and might know what's going on? (And, in the absence of Finns: Jan, heard anything from across the border? :p) The only caution I'd raise is that these numbers don't include spider filtering. Why is this important? Well, a lot of traffic is driven by crawlers and spiders and automata, particularly on smaller projects, and it can lead to weirdness as a result. With the granular pagecount files there's some work that can be done to detect this (for example, using burst detection and a few heuristics around concentration measures to eliminate pages that are clearly driven by automated traffic - see the recent analytics mailing list thread) but only some. I appreciate this is a flaw in the data we are releasing, not in your work, which is an excellent read and highly interesting :). I agree that understanding the lack of development in the PRC and ROK is crucial - we keep talking about the next billion readers but only talking :( On 16 March 2015 at 02:21, h hant...@gmail.com wrote: Dear all, I have some findings to show the page views per Internet user measurement may help comparing different language editions of Wikipedia. Criticism and suggestions are welcome. - http://people.oii.ox.ac.uk/hanteng/2015/03/15/comparing-language-development-in-wikipedia-in-terms-of-page-views-per-internet-users/ Which language version of Wikipedia enjoys the most page views per language Internet user than expected? It is Finnish. In terms of absolute positive and negative gap, English has the widest positive gap whereas Chinese has the largest negative gap. .. In particular, it is known that Wikipedia (and Google which often favours Wikipedia) faces local competition in the People's Republic of China and South Korea. Therefore it is understandable the page views may be lower in Chinese and Korean Wikipedia language projects simply because some users' need to read user-generated encyclopedias are satisfied by other websites. However, it remains an important question to examine why these particular Latin and Asian languages are under-developed for Wikipedia projects. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers? Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs. *The new format* At the moment we have the format: project_notation - encoded_title - pageviews - bytes This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is h-inducing. What I'd like to use as a new format is: full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews This file would: 1. Include a header row; 2. Be formatted as a tab-separated, rather than space-separated, file; 3. Exclude bytecounts; 4. Include desktop and mobile pageview counts on the same line; 5. Use the full project URL (en.wikivoyage.org) instead of the pagecounts-specific notation (en.v) So, as a made-up example, instead of: de.m.v Florence 32 9024 de.v Florence 920 7570 we'd end up with: de.wikivoyage.org Florence 920 32 In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away. I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below). *The size constraints* There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller. *What I'm asking for* Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be? *What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto). The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad. Thoughts? -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Release]
That is the question, and I agree with your conclusion. I'm hoping to do more research into this; getting buyin internally has been tough, but I'm confident of making progress on that front over the next few weeks and months. On 4 March 2015 at 04:13, Cristian Consonni kikkocrist...@gmail.com wrote: 2015-03-04 8:44 GMT+01:00 Dario Taraborelli dtarabore...@wikimedia.org: yay, shiny! The map is a pretty compelling way to show how dominant traffic from the US is, even for very minor languages (say bi.wikipedia.org), I wonder how many requests from US-based bots/automata we’re still failing to detect. Still, the question could be: are we fulfilling the mission? (hint: probably not) Cristian ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Release]
On 4 March 2015 at 04:28, Pine W wiki.p...@gmail.com wrote: I'm not sure how much influence I have, but I would be happy to make whispers in appropriate places to try to get more support, if that's helpful. I think I'm probably good, but thank you. Perhaps you could show your work at the next Research and Data showcase? I for one would be interested in seeing a presentation. That's in 3 weeks; I'm not convinced that a piece of substantive, useful research about global reach could be done in that time period even if I could drop everything I currently have (which I can't). This problem is too big and too important to be scheduled around meetings; things should work the other way around. Scott Hale and I have been working on a paper looking at global reach and how it tracks with internet access growth, in the context of editing, particularly looking at the mobile web. That, we should be done with by then; presenting it could be highly useful (Scott? ;p) Pine This is an Encyclopedia One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future, The clear water we must leave untainted for those who come after us, The fertile earth, in which truth may grow in bright places, tended by many hands, And the broad fall of sunshine, warming our first steps toward knowing how much we do not know. —Catherine Munro On Wed, Mar 4, 2015 at 1:25 AM, Oliver Keyes oke...@wikimedia.org wrote: That is the question, and I agree with your conclusion. I'm hoping to do more research into this; getting buyin internally has been tough, but I'm confident of making progress on that front over the next few weeks and months. On 4 March 2015 at 04:13, Cristian Consonni kikkocrist...@gmail.com wrote: 2015-03-04 8:44 GMT+01:00 Dario Taraborelli dtarabore...@wikimedia.org: yay, shiny! The map is a pretty compelling way to show how dominant traffic from the US is, even for very minor languages (say bi.wikipedia.org), I wonder how many requests from US-based bots/automata we’re still failing to detect. Still, the question could be: are we fulfilling the mission? (hint: probably not) Cristian ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Release]
'Lots, but that's not currently anyone's job' On Wednesday, 4 March 2015, Dario Taraborelli dtarabore...@wikimedia.org wrote: yay, shiny! The map is a pretty compelling way to show how dominant traffic from the US is, even for very minor languages (say bi.wikipedia.org), I wonder how many requests from US-based bots/automata we’re still failing to detect. On Mar 3, 2015, at 9:29 PM, Oliver Keyes oke...@wikimedia.org javascript:; wrote: Update: the original Shiny instance went down due to server load soon after release. It's now up again at http://datavis.wmflabs.org/where/ on a dedicated Labs machine, where we hope to put...many more visualisations. It also now has mapping, largely thanks to Sarah Laplante (http://sarahlaplante.com/), and soon it will hopefully be /non-hideous/ mapping (the current mass of blue and grey is because my aesthetic tastes are...I don't actually have any aesthetic tastes) On 2 March 2015 at 22:36, Oliver Keyes oke...@wikimedia.org javascript:; wrote: Indeed! Orienting it that way (pivoting on language rather than project) is something several people have asked for; I plan to spend a chunk of my spare time (that is, recreational time) trying to make it work. Should be fairly trivial. On 2 March 2015 at 09:55, h hant...@gmail.com javascript:; wrote: Hello Finn, I do not have a specific answer to your question. However, it might be worthwhile to add Finnish in to the comparison as according to the CLDR 26 T-L information http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html You have some sizable Finnish language speakers in Sweden: Swedish {O} sv 95.0% 99.0% Finnish {OR} fi 2.2% So if the similar query is executed on Finnish language, and the results also show some undue proportion of visits from Sweden, then what you observed as anomaly is the that unique. We probably need many iterations of comparative outcomes and normalization of data (Sweden does have higher population). Also, it might be handy to have some statistics on immigration or residence, it is EU. I will not be surprised that for example the visits from Oxford to Wikipedia website have sizable German language requests. I am still a bit bothered by the number 1 in the current dataset. It does not feel right since the numbers of 1.4% and 0.6% is a notable difference in this regard. Perhaps we need some high precision universal percentage number for each territory-language pair. It would be also great to do another set of aggregation: i.e. given a territory, which language versions of Wikipedia are accessed Best, han-teng liao 2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen f...@imm.dtu.dk javascript:;: Hi Oliver, Interesting dataset! I am curious about why the Danish Wikipedia is so highly acccessed from Sweden. Could it be an error, e.g., with Telia IP-numbers? In Python: import pandas as pd df = pd.read_csv(' http://files.figshare.com/1923822/language_pageviews_per_country.tsv', sep='\t') df.ix[df.project == 'da.wikipedia.org', ['country', 'pageviews_percentage']].set_index('country') pageviews_percentage country Austria1 China 1 Denmark 61 Estonia1 France 1 Germany2 Netherlands2 Norway 1 Sweden18 United Kingdom 3 United States 3 Other 5 MaxMind has some numbers on their own accuracy: https://www.maxmind.com/en/geoip2-city-database-accuracy For Denmark 85% is Correctly Resolved, for Sweden only 68%. I wonder if this really could bias the result so much. If the numbers are correct why would the Swedish read the Danish Wikipedia so much? Bots? It does not apply the other way around: Only 2% of the traffic to Swedish Wikipedia comes from Denmark. best regards Finn On 02/25/2015 10:06 PM, Oliver Keyes wrote: Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Finn Årup Nielsen http://people.compute.dtu.dk/faan/ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Release]
Update: the original Shiny instance went down due to server load soon after release. It's now up again at http://datavis.wmflabs.org/where/ on a dedicated Labs machine, where we hope to put...many more visualisations. It also now has mapping, largely thanks to Sarah Laplante (http://sarahlaplante.com/), and soon it will hopefully be /non-hideous/ mapping (the current mass of blue and grey is because my aesthetic tastes are...I don't actually have any aesthetic tastes) On 2 March 2015 at 22:36, Oliver Keyes oke...@wikimedia.org wrote: Indeed! Orienting it that way (pivoting on language rather than project) is something several people have asked for; I plan to spend a chunk of my spare time (that is, recreational time) trying to make it work. Should be fairly trivial. On 2 March 2015 at 09:55, h hant...@gmail.com wrote: Hello Finn, I do not have a specific answer to your question. However, it might be worthwhile to add Finnish in to the comparison as according to the CLDR 26 T-L information http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html You have some sizable Finnish language speakers in Sweden: Swedish {O} sv 95.0% 99.0% Finnish {OR} fi 2.2% So if the similar query is executed on Finnish language, and the results also show some undue proportion of visits from Sweden, then what you observed as anomaly is the that unique. We probably need many iterations of comparative outcomes and normalization of data (Sweden does have higher population). Also, it might be handy to have some statistics on immigration or residence, it is EU. I will not be surprised that for example the visits from Oxford to Wikipedia website have sizable German language requests. I am still a bit bothered by the number 1 in the current dataset. It does not feel right since the numbers of 1.4% and 0.6% is a notable difference in this regard. Perhaps we need some high precision universal percentage number for each territory-language pair. It would be also great to do another set of aggregation: i.e. given a territory, which language versions of Wikipedia are accessed Best, han-teng liao 2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen f...@imm.dtu.dk: Hi Oliver, Interesting dataset! I am curious about why the Danish Wikipedia is so highly acccessed from Sweden. Could it be an error, e.g., with Telia IP-numbers? In Python: import pandas as pd df = pd.read_csv('http://files.figshare.com/1923822/language_pageviews_per_country.tsv', sep='\t') df.ix[df.project == 'da.wikipedia.org', ['country', 'pageviews_percentage']].set_index('country') pageviews_percentage country Austria1 China 1 Denmark 61 Estonia1 France 1 Germany2 Netherlands2 Norway 1 Sweden18 United Kingdom 3 United States 3 Other 5 MaxMind has some numbers on their own accuracy: https://www.maxmind.com/en/geoip2-city-database-accuracy For Denmark 85% is Correctly Resolved, for Sweden only 68%. I wonder if this really could bias the result so much. If the numbers are correct why would the Swedish read the Danish Wikipedia so much? Bots? It does not apply the other way around: Only 2% of the traffic to Swedish Wikipedia comes from Denmark. best regards Finn On 02/25/2015 10:06 PM, Oliver Keyes wrote: Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Finn Årup Nielsen http://people.compute.dtu.dk/faan/ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Release]
Indeed! Orienting it that way (pivoting on language rather than project) is something several people have asked for; I plan to spend a chunk of my spare time (that is, recreational time) trying to make it work. Should be fairly trivial. On 2 March 2015 at 09:55, h hant...@gmail.com wrote: Hello Finn, I do not have a specific answer to your question. However, it might be worthwhile to add Finnish in to the comparison as according to the CLDR 26 T-L information http://www.unicode.org/cldr/charts/26/supplemental/territory_language_information.html You have some sizable Finnish language speakers in Sweden: Swedish {O} sv 95.0% 99.0% Finnish {OR} fi 2.2% So if the similar query is executed on Finnish language, and the results also show some undue proportion of visits from Sweden, then what you observed as anomaly is the that unique. We probably need many iterations of comparative outcomes and normalization of data (Sweden does have higher population). Also, it might be handy to have some statistics on immigration or residence, it is EU. I will not be surprised that for example the visits from Oxford to Wikipedia website have sizable German language requests. I am still a bit bothered by the number 1 in the current dataset. It does not feel right since the numbers of 1.4% and 0.6% is a notable difference in this regard. Perhaps we need some high precision universal percentage number for each territory-language pair. It would be also great to do another set of aggregation: i.e. given a territory, which language versions of Wikipedia are accessed Best, han-teng liao 2015-03-02 13:54 GMT+01:00 Finn Årup Nielsen f...@imm.dtu.dk: Hi Oliver, Interesting dataset! I am curious about why the Danish Wikipedia is so highly acccessed from Sweden. Could it be an error, e.g., with Telia IP-numbers? In Python: import pandas as pd df = pd.read_csv('http://files.figshare.com/1923822/language_pageviews_per_country.tsv', sep='\t') df.ix[df.project == 'da.wikipedia.org', ['country', 'pageviews_percentage']].set_index('country') pageviews_percentage country Austria1 China 1 Denmark 61 Estonia1 France 1 Germany2 Netherlands2 Norway 1 Sweden18 United Kingdom 3 United States 3 Other 5 MaxMind has some numbers on their own accuracy: https://www.maxmind.com/en/geoip2-city-database-accuracy For Denmark 85% is Correctly Resolved, for Sweden only 68%. I wonder if this really could bias the result so much. If the numbers are correct why would the Swedish read the Danish Wikipedia so much? Bots? It does not apply the other way around: Only 2% of the traffic to Swedish Wikipedia comes from Denmark. best regards Finn On 02/25/2015 10:06 PM, Oliver Keyes wrote: Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Finn Årup Nielsen http://people.compute.dtu.dk/faan/ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] [Release]
Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Release]
The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly. On 25 February 2015 at 16:32, Andrew Lih andrew@gmail.com wrote: Great job. Who knew Esperanto was big in Japan and China at #2 and #3? On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes oke...@wikimedia.org wrote: Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] [Release]
Totally! I'm also going to get together with some NEU hackers tomorrow and work on actually visualising the data on *drumroll* maps, which'd probably be more interesting eye candy than infinite bar plots :) On 25 February 2015 at 16:19, Pine W wiki.p...@gmail.com wrote: Very nice. Do you think that you could pick out a few of your favorite graphs and add them to this week's Recent Research report in a gallery? Thanks! Pine Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] [Release]
Yours is looking at just December, while mine is looking at the entire year, for starters. Also, what's the apps/mobile web inclusion for that report? On 25 February 2015 at 17:34, Erik Zachte ezac...@wikimedia.org wrote: I am surprised that the new data, with crawlers excluded, show more wp:en traffic from US (43%) than the old data (36.4% for 2014), which contained much crawler traffic, presumably most of that from US. Compare https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ and http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerLanguageBreakdown.htm Any thoughts? Erik -Original Message- From: analytics-boun...@lists.wikimedia.org [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Wednesday, February 25, 2015 22:37 To: Research into Wikimedia content and communities Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Wiki-research-l] [Release] The one major caveat, I think, is that the danger of proportionate data is that it makes small projects very vulnerable to artificial traffic spikes. I'd go out on a limb and say that some of the massive bumps in popularity we see in particular combinations are likely due to either undetected automata or simply a project having so little traffic that a small number of people can sway the results outlandishly. On 25 February 2015 at 16:32, Andrew Lih andrew@gmail.com wrote: Great job. Who knew Esperanto was big in Japan and China at #2 and #3? On Wed, Feb 25, 2015 at 4:06 PM, Oliver Keyes oke...@wikimedia.org wrote: Hey all! We've released a highly-aggregated dataset of readership data - specifically, data about where, geographically, traffic to each of our projects (and all of our projects) comes from. The data can be found at http://dx.doi.org/10.6084/m9.figshare.1317408 - additionally, I've put together an exploration tool for it at https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/ Hope it's useful to people! -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Scholarly citations by DOI in Wikipedia
Sweet! Can I ask that we make the 2% explicitly available to wiki gnomes? :) On Monday, 9 February 2015, Aaron Halfaker ahalfa...@wikimedia.org wrote: Hey folks, Dario and I just updated the scholarly citations dataset to include Digital Object Identifiers. We found 742k citations (524k unique DOIs) in 172k articles. Our spot checking suggests that 98% of these DOIs resolve. The remaining 2% were extracted correctly, but they appear to be typos. http://dx.doi.org/10.6084/m9.figshare.1299540 Like the dataset that we released for PubMed Identifiers, this dataset includes the first known occurrence of a DOI citation in an English Wikipedia article and the associated revision metadata, based on the most recent complete content dump of English Wikipedia. Feel free to share this with anyone interested via: https://twitter.com/WikiResearch/status/564908585008627712 We'll be organizing our own work and analysis of these citations here: https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia -Aaron -- Sent from my mobile computing device of Lovecraftian complexity and horror. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations
We should be! We can sync templatelink or externalink entry timestamps with revision table timestamps. Sounds like a fun project! On 6 February 2015 at 16:15, Kerry Raymond kerry.raym...@gmail.com wrote: I agree it’s a good thing overall. I’m just alerting us to the potential problem it might create. I note it might not just be the academics themselves. In Australia at least, institutional research rankings are heavily based on citation counts. Our “Excellence in Research Assessment” (ERA) process creates massive institutional pressure to track down every possible citation, much of which is done by the library and admin teams. And with Wikipedia, it’s so easy to create a new citation … it’s hard to believe some people won’t be tempted … Are we able to extract the user names associated with adding links to academic papers? Some time downstream analysis of that data might be interesting, especially if there do appear to be clusters of cited papers with common author names added by the same user name or IP address. There’s almost certainly a publication in that! J But as I say, so long as the papers are actually relevant where they are cited in Wikipedia, this is not a bad thing for Wikipedia if academics do decide to promote their work that way. Kerry From: wiki-research-l-boun...@lists.wikimedia.org [mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of Aaron Halfaker Sent: Saturday, 7 February 2015 1:25 AM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations I agree that we could be doing something interesting with the social dynamics of Wikipedia editing by releasing this dataset -- and that some new problems may result. However, I think that it's much better to have too much academic interest than not enough. With a little AGF and diligence, we ought to be able to deal with this problem like we've dealt with quality control concerns in the past. Academics have to be very careful about their reputation, and it's hard to cite your own unnecessarily without giving up who you are since your name's going to be on the paper. Either way, this is a useful dataset for library sciences work and it's public anyway. We're just making it easier to work with. Honestly, that's how I got started working in this space -- helping someone get data for their own research. -Aaron On Fri, Feb 6, 2015 at 5:50 AM, mjn m...@anadrome.org wrote: I agree it's not a new worry, but it might change the nature of the problem a bit, and is worth at least being vigilant about. I did have a similar idea some years ago, to compute an impact factor for being-cited-on-Wikipedia, but after discussing it with some colleagues, didn't do so specifically because of the worry that it would encourage more gaming of Wikipedia citations. Of course it's inevitable that someone would eventually do it, but I still think it was probably right on balance to not push that date forward. Regarding the SEO analogy, the external links on Wikipedia are on average not the best part of Wikipedia, so it's not a very heartening The citations for now are not nearly as spammy as the external links are, and I hope it stays that way! It's of course not new that there is an incentive to spam citations. Even without explicit Wikipedia-citation-tracking, there are incentives to spam marginally relevant citations in order to increase perceived prominence. Maybe being in a Wikipedia article will get your paper in front of more grad students who will end up citing it for real after encountering it on Wikipedia, etc. A direct citation count feels like it's likely to exacerbate that, since now removing an irrelevant citation to someone's article is a direct attack on their metrics! Though it's possible the actual effect on editing patterns will be small. From a research perspective, the new datasets of citations might be interesting to track over time, and correlate back to editors, to see if there are any interesting (or interesting) patterns. -Mark -- mjn | http://www.anadrome.org Oliver Keyes ironho...@gmail.com writes: And SEO spammers will add themselves, too! This is not a new problem. On Thursday, 5 February 2015, Kerry Raymond kerry.raym...@gmail.com wrote: Do I understand this correctly? That Wikipedia articles that cite academic publications will be included in citation count now (at least for altmetrics). While that’s great recognition for Wikipedia as a corpus of scholarly work, does that mean Wikipedia will be overrun with academic authors adding citations to their academic papers in any Wikipedia article they can get away with in order to improve their citation counts for their CVs? I note that generally we can spot self-citation because the two papers will have an author name in common, but with the ability to edit
Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations
It also requires SEO people to demonstrate a modicum of logical reasoning skills. Sadly, from my work on understanding our traffic trends, this appears to be beyond at least some of them. On Friday, 6 February 2015, Laura Hale la...@fanhistory.com wrote: That's actually a Wikipedia thing, by putting a rel=nofollow class=external text in the article source code. Internal articles in contrast say a href= class=internal That's not a Google good will thing. Sincerely, Laura Hale On Fri, Feb 6, 2015 at 9:44 PM, Kerry Raymond kerry.raym...@gmail.com javascript:_e(%7B%7D,'cvml','kerry.raym...@gmail.com'); wrote: I thought that Wikipedia addressed the SEO problem by getting Google to not follow the off-wiki links when crawling, so that Wikipedia's page rank would not follow through to off-Wikipedia links. But I cannot (using Google) find the page where I read that. While that doesn't prevent people from spamming Wikipedia with external links to catch people's eyeballs while reading Wikipedia, it should address the SEO problem somewhat. Kerry ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- twitter: purplepopple -- Sent from my mobile computing device of Lovecraftian complexity and horror. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Altmetric.com now tracks Wikipedia citations
And SEO spammers will add themselves, too! This is not a new problem. On Thursday, 5 February 2015, Kerry Raymond kerry.raym...@gmail.com wrote: Do I understand this correctly? That Wikipedia articles that cite academic publications will be included in citation count now (at least for altmetrics). While that’s great recognition for Wikipedia as a corpus of scholarly work, does that mean Wikipedia will be overrun with academic authors adding citations to their academic papers in any Wikipedia article they can get away with in order to improve their citation counts for their CVs? I note that generally we can spot self-citation because the two papers will have an author name in common, but with the ability to edit Wikipedia anonymously and pseudonymously means that we cannot spot self-citation. While judging research purely on citation counts is a deeply flawed method of assessment, nonetheless it is a reality and the pressure on folks to “game” the system is tremendous given the role it can play in appointment, tenure, promotion and grant applications. On the positive side, we might be able to get rid of a lot of citation-needed tags. Kerry -- *From:* wiki-research-l-boun...@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org'); [mailto:wiki-research-l-boun...@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org');] *On Behalf Of *Pine W *Sent:* Friday, 6 February 2015 8:13 AM *To:* Wiki Research-l; Raymond Leonard; Wikimedia GLAM collaboration [Public]; North American Cultural Partnerships *Subject:* [Wiki-research-l] Altmetric.com now tracks Wikipedia citations FYI: http://www.altmetric.com/blog/new-source-alert-wikipedia/ Pine This is an Encyclopedia https://www.wikipedia.org/ * One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future, The clear water we must leave untainted for those who come after us, The fertile earth, in which truth may grow in bright places, tended by many hands, And the broad fall of sunshine, warming our first steps toward knowing how much we do not know. —Catherine Munro * -- Sent from my mobile computing device of Lovecraftian complexity and horror. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] R library for URL handling
Also, version 1.2.0 of the R MW API client library - https://github.com/Ironholds/WikipediR (what can I say, being semi-bedridden makes me a productive little gnome) On 21 January 2015 at 18:47, Oliver Keyes oke...@wikimedia.org wrote: Possibly of interest to any researchers who work with our pageview/requests data: I've just release v1.0.0 of urltools,[0] a library that provides very very fast vectorised URL decoding and parsing. Might be useful for the useRs in our community! See the associated vignette for functionality.[1] [0] https://github.com/Ironholds/urltools [1] https://github.com/Ironholds/urltools/blob/master/vignettes/urltools.Rmd -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] R library for URL handling
Possibly of interest to any researchers who work with our pageview/requests data: I've just release v1.0.0 of urltools,[0] a library that provides very very fast vectorised URL decoding and parsing. Might be useful for the useRs in our community! See the associated vignette for functionality.[1] [0] https://github.com/Ironholds/urltools [1] https://github.com/Ironholds/urltools/blob/master/vignettes/urltools.Rmd -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
On Wed, Jan 14, 2015 at 3:39 AM, John Mark Vandenberg jay...@gmail.com wrote: On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes ironho...@gmail.com wrote: I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on. The proposed element to be added is geolocation below country level. Default Nginx and Apache log formats do not include geolocation. Which is why this research proposal exists and is being discussed, and rightly so. Gotcha: I thought you were referring to the information we already have. fwiw, the Nginx geoip module is not even included, by default, when compiling the source code. As the paper explicitly describes, and is a common theme in research proposals, Wikimedia access log information is user reading behaviour being captured. The old privacy and data retention policies gave users the expectation that access log data was destroyed after a set period, assumed to be only three months as that was the limit of Checkuser visibility. The current policies are more like yes we collect a lot of data about users, using tracking technology, and please trust us. And sorry we dont honour 'Dont track us', as we presumed that you trust us and the researchers that we allow to access our analytics. We should be planning for what will be the effect when the WMF servers are hacked and _all_ of the analytics data is now in the hands of a repressive government or similar. Or, imagine the WMF sends the analytics data across an insecure link which is tapped and the data reconstructed, either due to not using secure links at all, or an accidental routing problem. https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html The geolocation proposal is to perform it over IP addresses...which are already stored. So, the only major difference between hacking now and hacking later is that doing it later means you don't have to spend 99 bucks on a geolocation hashtable. If/When that day comes, hopefully they don't have much data to make inferences from, and what data they obtain can be well justified. Having a quick peak, I thought it was odd that browser Wikimedia sites now causes impressions to be sent back to the WMF servers with the country of the user included. This is a workaround to simplify analytics. https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee8775a5df9f68857a337efadbb2b5d36811f1a/special%2FSpecialRecordImpression.php CentralNotice and the fundraising banners have done this for absolutely years, yes; that's the code you're looking at. The more you collect, especially using multiple systems to collect similar data, the more likely that if subpoenaed, WMF's various datasets could be used to infer a pretty reliable answer to which days in 2013 was John Vandenberg in Indonesia?, or when did John Vandenberg first read the Wikipedia article about bomb making ingredient? The more you publish, even aggregated, the more likely these types of questions can be inferred without a subpoena, at least for users with large enough lists of public contributions, by scientists like yourself with lots of computation power and plenty of time on their hands rifling through the data to *infer* the identify of editors, and if it is a government body they also have lots of other datasets which can be used to assist in the task. Yep, and that's why we're discussing this. Adding fine-grained geolocation information to published page views is an example of the latter and the paper wisely suggests not including logged in users as a possible solution to some of the privacy issues. There is also the problem that many IPs can be easily inferred to be a single cohort of people in some situations. e.g. in regions where the only large collection of computers is an single facility, e.g. a school. In a repressive regime especially, that could lead to official questions being asked like: why were so many students at this school reading about blah on date. And teachers being identified as responsible, etc. The paper considers IP users vs logged in users to be a binary set. However there are tools built which exploit the fact that logged in users make a logged out edit which identifies their IP. Add geolocation of pageviews and we can infer the probability that other IPs in their smallest geolocation block are also likely to be edits by the same person, as the algorithm in the paper leaks 'number of active editors in each region each day'. No, it doesn't: the proposal is to aggregate. Where there are few observations (or little variation in observations) within a geographic region, the data will be moved up one level and aggregated, and so on until a sufficient degree
Re: [Wiki-research-l] How many links did TWL account recipients add to Wikipedia with their access?
Actually, the API structure has a grab all of the external links from [page] query: I'm not sure if it can be applied to historical revisions, but we can see! On Wed, Jan 14, 2015 at 2:23 AM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, These same people may have added content to Wikidata ... Obviously it has not been considered. However, you can query for these people there. You can also query how many external references were added by bot. It may provide the groundwork going to Wikipedias and find who did it .. references to external sources are added all the time. Thanks, GerardM On 14 January 2015 at 00:23, Aaron Halfaker ahalfa...@wikimedia.org wrote: I don't think that you can do this with quarry since you'll need to parse wiki content in order to extract external links. I don't think they are stored in a table anywhere. However, I think that we can do it fairly easily with a process on the XML dumps. If you can give me the list of editors and partner websites, I could put a script together and talk to you about the bits. -Aaron On Tue, Jan 13, 2015 at 5:01 PM, Jake Orlowitz jorlow...@gmail.com wrote: Hi all, There are 2000 editors who have received access to 20 different online databases. We know the usernames of these editors and the url prefixes of the websites they were given access to. We need to know: - from July 18th 2014 to January 11th 2014 - on English Wikipedia - for the cohort of 2000 TWL editors - ...how many times did they add links to any of the 20 partner websites I have my fingers crossed that Quarry can solve this but I need some help to write a query. Bonus queries: 1) In that date range, how many links did these editors add using partner websites on *all Wikipedias* (any language) 2) What is the baseline change in all external links on English Wikipedia in that date range 3) What is the baseline change in all external links on *all Wikipedias* since July 18th 2014 Thanks so much for any guidance on this! Jake (Ocaasi) The Wikipedia Library ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on. On Tuesday, 13 January 2015, John Mark Vandenberg jay...@gmail.com wrote: On Wed, Jan 14, 2015 at 9:22 AM, Andrew Gray andrew.g...@dunelm.org.uk javascript:; wrote: Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here. I think you're right to be concerned about this. It is about expectations; people do not expect a NGO providing an encyclopedia to be silently capturing reading behaviour data. If the data is provided to other entities, even for noble research objectives, people expect Do Not Track to cover this. https://cyberlaw.stanford.edu/node/6573 -- John Vandenberg ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Sent from my mobile computing device of Lovecraftian complexity and horror. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Endowment perpetuity
Speaking of questions that open with an agenda, have you tried asking the fundraising team? On 12 January 2015 at 20:07, James Salsman jsals...@gmail.com wrote: Speaking of fundraising far over budget, did the question about an endowment perpetuity make it on to the last donor survey? If so, what was the result? I seem to remember a favorable response, from somewhere, but can't find anything either way. Was it a Board poll? ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: $55 million raised in 2014
Bah; dropped a digit when reading the y-axis. My bad. My concerns about straight extrapolation for this model remain, however. On 2 January 2015 at 11:31, Oliver Keyes oke...@wikimedia.org wrote: 3 billion being...above the upper bound of the extrapolation you've made? Uh-huh. Extrapolation is not a particularly useful method to use for the budget, because it assumes endless exponential growth. I can see the budget increasing due to us increasingly taking on the responsibilities we've previously been unable to do anything about, but I can't see what we'd actually /do/ with 3 billion dollars (although if we want to expand the Hadoop cluster with most of that I would, of course, be most grateful ;p) On 2 January 2015 at 04:23, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, It is known that education is a great way to eradicate poverty. We know that Wikipedia brings information and is educational. When the effect of your 3 billion dollar brings education and effectively helps to eradicate poverty it is well worth it. No irony intended. Thanks, GerardM On 2 January 2015 at 09:11, James Salsman jsals...@gmail.com wrote: In ten years time, I predict the Foundation will raise $3 billion: http://i.imgur.com/hdoAIan.jpg -- Forwarded message -- From: James Salsman jsals...@gmail.com Date: Thu, Jan 1, 2015 at 9:01 PM Subject: $55 million raised in 2014 To: Wikimedia Mailing List wikimedi...@lists.wikimedia.org Happy new year: http://i.imgur.com/faPsI9J.jpg Source: http://frdata.wikimedia.org/yeardata-day-vs-ytdsum.csv I don't mind the banners, although I am still saddened that several hundred editor-submitted banners remain untested from six years ago, when the observed variance in the performance of those that were tested indicates that there are likely at least 15 which would do better than any of those which were tested. Why the heck is the fundraising team still ignoring all those untested submissions? But as to the intrusiveness of the banners, I would rather have fade-in popups with fuschia blinkmarquee text on a epileptic seizure-inducing background and auto-play audio than have the fundraising director claim that donations are decreasing to help justify narrowing scope. Best regards, James Salsman ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: $55 million raised in 2014
On 2 January 2015 at 15:08, James Salsman jsals...@gmail.com wrote: Oliver Keyes wrote: ... Extrapolation is not a particularly useful method to use for the budget, because it assumes endless exponential growth. I agree. Formal budgeting usually shouldn't extend further than three to five years in the nonprofit sector (long-term budgeting is unavoidable in government and some industry.) However, here are a couple illustrations of some reasons I believe a ten year extrapolation of Foundation fundraising is completely reasonable: http://imgur.com/a/mV72T Words tend to be more useful than contextless images. ... I can't see what we'd actually /do/ with 3 billion dollars I used to be in favor of a establishing an endowment with a sufficient perpetuity, and then halting fundraising forever, but I have changed my mind. I think the Foundation should continue to raise money indefinitely to pay people for this task: https://meta.wikimedia.org/wiki/Grants:IEG/Revision_scoring_as_a_service That is equivalent to a general computer-aided instruction system, with the side effects of both improving the encyclopedia and making counter-vandalism bots more accurate. As an anonymous crowdsourced review system based on consensus voting instead of editorial judgement, it leaves the Foundation immunized with their safe harbor provisions regarding content control intact. It's also not worth 3 billion dollars (no offence, Aaron!) as evidenced by the fact that it can be established with 20k. This is not a discussion for research-l, this is a discussion for (at best) Wikimedia-l - and I have to say that I don't feel it's at all useful even /there/, but it is at least in context. Spending time discussing pie-in-the-sky what would we do if we had 3 billion dollars ideas is all well and nice, but I prefer to think that time is better spent doing research with the resources we have now, and editing with the resources we have now, and making pitches for additional resources as and when they become available. So on that note: I'm going to go off and do that. -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fwd: $55 million raised in 2014
On 2 January 2015 at 18:14, James Salsman jsals...@gmail.com wrote: ... This is not a discussion for research-l On the contrary, please see e.g. http://www.wikisym.org/os2014-files/proceedings/p609.pdf this Foundation-sponsored IEG effort can serve as a confirmatory replication of that prior work. Let me rephrase, because I evidently wasn't clear: fanciful and unrealistic discussions of what we'd do with a pot of money that wouldn't be available for a decade and only exists in the first place if you assume exponential growthare not for research-l ... time is better spent doing research with the resources we have now I wish someone would please replicate my measurement of the variance in the distribution of fundraising results using the editor-submitted banners from 2008-9, and explain to the fundraising team that distribution implies they can do a whole lot better than sticking with the spiel which degrades Foundation employees by implying they typically spend $3 or £3 on coffee. (Although I wouldn't discount the possibility that some donors feel good about sending Foundation employers to boutique coffee shops.) We know donor message- and banner-fatigue exists as a strong effect which limits the useful life of fundraising approaches in some cases, so they have to keep trying to keep up. When are they going to test the remainder of the editors' submissions? Given that you've been asking for that analysis for four years, and it's never been done, and you've been repeatedly told that it's not going to happen, could youtake those hints? And by hints, I mean explicit statements. I appreciate that you're operating in good faith, but there comes a point when http://wondermark.com/1k62/ starts proving that life imitates art. Repeatedly having this same conversation is a colossal, ever-draining waste of everyone's time. Please stop bringing it up. -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Looking for reader's click log data for Wikipedia
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example). On 28 December 2014 at 21:00, Ditty Mathew ditty...@gmail.com wrote: Hi, Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia. with regards Ditty ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Looking for reader's click log data for Wikipedia
I'm not exactly sure how one provides an anonymised dataset that contains IP addresses. But: We don't have those navigation paths and so can't provide them. Sure, we could provide the {referer, URL} tuples associated with specific IP addresses, and replace the IP with some kind of randomly-generated value (or just a salted hash) but this falls apart very quickly with the modern structure of the internet and the scale Wikimedia properties operate on: you can have a lot of distinct people at one IP address, particularly through cellular networks, and so multiple sessions and trails can get inaccurately grouped together. More importantly, the HTTPS protocol involves either sanitising or completely stripping referers, rendering those chains impossible to reconstruct. I believe Leila Zia and Bob West (who will hopefully see this message. I know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary. But generally-speaking: we do not generate this data as a matter of course, we would not be comfortable releasing it (unless exceedingly sanitised), and as the person who deals with our request logs on a day-to-day basis I can think of a half-dozen ways in which it would produce false results (ways we have no real way of checking the probability of occurring). On 28 December 2014 at 22:53, Ditty Mathew ditty...@gmail.com wrote: The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers. with regards Ditty On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes oke...@wikimedia.org wrote: Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example). On 28 December 2014 at 21:00, Ditty Mathew ditty...@gmail.com wrote: Hi, Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia. with regards Ditty ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Editor sessions and related metrics
Hey all, Not sure if this would be interesting to researchers or community members, but: you might remember a paper Stuart and Aaron did a while ago about measuring edit sessions - http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Measure_Participation_in_Wikipedia/geiger13using-preprint.pdf To me it's really interesting, because it's (as much as anything else) a new metric for measuring participation, and a metric we can extract additional metrics from (e.g., session length). As part of some related work on /reader/ sessions, I wrote a pile of code to handle session reconstruction. I've generalised it (it doesn't care if you've got reader timestamps, editor timestamps, or best buy receipt timestamps) and thrown it up at https://github.com/Ironholds/reconstructr . I figure it could be useful to any researchers or community members looking into sessions. Thanks, -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Editor sessions and related metrics
Totally; already threw it at the internal research list :) On 16 December 2014 at 14:37, Toby Negrin tneg...@wikimedia.org wrote: Awesome work! Can we distribute in the foundation? On Tue, Dec 16, 2014 at 10:40 AM, Oliver Keyes oke...@wikimedia.org wrote: Hey all, Not sure if this would be interesting to researchers or community members, but: you might remember a paper Stuart and Aaron did a while ago about measuring edit sessions - http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Measure_Participation_in_Wikipedia/geiger13using-preprint.pdf To me it's really interesting, because it's (as much as anything else) a new metric for measuring participation, and a metric we can extract additional metrics from (e.g., session length). As part of some related work on /reader/ sessions, I wrote a pile of code to handle session reconstruction. I've generalised it (it doesn't care if you've got reader timestamps, editor timestamps, or best buy receipt timestamps) and thrown it up at https://github.com/Ironholds/reconstructr . I figure it could be useful to any researchers or community members looking into sessions. Thanks, -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Pageviews, mobile versus desktop
Yep; same timeframe. On 15 December 2014 at 12:50, Federico Leva (Nemo) nemow...@gmail.com wrote: Oliver Keyes, 13/12/2014 21:15: http://ironholds.org/misc/pageviews_year_and_week.png - fascinating! It reveals a lot of seasonality in the desktop views - again, not replicated on mobile (at least, not so strongly) Does this graph also go from 2013-02-01 to 2014-12-01? Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote)
Communities are who they choose to offer plaudits to. The people getting off largely scott-free in this case are people a highly vocal subgroup has put on a pedestal for years. I don't know if the community as a whole is that ugly and bitter, but I can understand where people would get the impression, from looking at the sort of person we celebrate. On Monday, 15 December 2014, WereSpielChequers werespielchequ...@gmail.com wrote: We have problems, I don't dispute that. But ugly and bitter as 4chan? That has to be an exaggeration. Regards Jonathan Cardy On 13 Dec 2014, at 01:03, Andrew Lih andrew@gmail.com javascript:_e(%7B%7D,'cvml','andrew@gmail.com'); wrote: I certainly hope you're right Sydney. What a horrible mess. On Fri, Dec 12, 2014 at 5:53 PM, Sydney Poore sydney.po...@gmail.com javascript:_e(%7B%7D,'cvml','sydney.po...@gmail.com'); wrote: I think feminists, especially those who take an interest in STEM, will pass this article around. Sydney On Dec 12, 2014 5:35 PM, Andrew Lih andrew@gmail.com javascript:_e(%7B%7D,'cvml','andrew@gmail.com'); wrote: It's a good piece, but honestly I think only the dedicated tech reader will make it through the entire story. There's a lot of jargon and insider intrigue such that I could imagine most people never making past the typewriter barf of BLP, AGF, NOR :) On Fri, Dec 12, 2014 at 5:26 PM, Dariusz Jemielniak dar...@alk.edu.pl javascript:_e(%7B%7D,'cvml','dar...@alk.edu.pl'); wrote: While I agree that the article is overly negative (likely because of the individual experience), I think it still points to an important problem. I don't perceive this article as really problematic in terms of image. Maybe naively, I imagine that people will not stop donating because the community is not ideal. pundit On Fri, Dec 12, 2014 at 11:16 PM, Kerry Raymond kerry.raym...@gmail.com javascript:_e(%7B%7D,'cvml','kerry.raym...@gmail.com'); wrote: There’s a saying that everyone likes to eat sausages but nobody likes to know how they are made. It is not good to have negative publicity like that during the annual donation campaign (irrespective of the motivations of the journalist and/or the rights/wrongs of the issue being reported, neither of which I intend to debate here). As a donation-funded organisation, public perception matters a lot. Kerry -- *From:* Jonathan Morgan [mailto:jmor...@wikimedia.org javascript:_e(%7B%7D,'cvml','jmor...@wikimedia.org');] *Sent:* Saturday, 13 December 2014 6:43 AM *To:* Research into Wikimedia content and communities *Cc:* Kerry Raymond *Subject:* Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote) I mostly agree. On one hand, it's always nice to see a detailed description of how wiki-sausage gets made in a major venue. On the other, this journalist clearly has a personal axe to grind, and used his bully pulpit to grind it in public. - J On Fri, Dec 12, 2014 at 1:39 AM, Federico Leva (Nemo) nemow...@gmail.com javascript:_e(%7B%7D,'cvml','nemow...@gmail.com'); wrote: 1000th addition to the inconsequential rant genre. Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Jonathan T. Morgan Community Research Lead Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmor...@wikimedia.org javascript:_e(%7B%7D,'cvml','jmor...@wikimedia.org'); ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- __ prof. dr hab. Dariusz Jemielniak kierownik katedry Zarządzania Międzynarodowego i centrum badawczego CROW Akademia Leona Koźmińskiego http://www.crow.alk.edu.pl członek Akademii Młodych Uczonych Polskiej Akademii Nauk członek Komitetu Polityki Naukowej MNiSW Wyszła pierwsza na świecie etnografia Wikipedii Common Knowledge? An Ethnography of Wikipedia (2014, Stanford University Press) mojego autorstwa http://www.sup.org/book.cgi?id=24010 Recenzje Forbes: http://www.forbes.com/fdc/welcome_mjx.shtml Pacific Standard: http://www.psmag.com/navigation/books-and-culture/killed-wikipedia-93777/ Motherboard: http://motherboard.vice.com/read/an-ethnography-of-wikipedia The Wikipedian: http://thewikipedian.net/2014/10/10/dariusz-jemielniak-common-knowledge ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org');
Re: [Wiki-research-l] How to track all the diffs in real time?
Oh dear god, that would be incredible. The non-streaming API has a wonderful bug: if you request a series of diffs, and there are 1 uncached diffs in that series, only the first uncached diff will be returned. For the rest it returns...an error? No. Some kind of special value? No. It returns an empty string. You know: that thing it also returns if there is no difference . So instead you stream edits and compute the diffs yourself and everything goes a bit Pete Tong. Having this service around would be a lifesaver. On 13 December 2014 at 10:14, Scott Hale computermacgy...@gmail.com wrote: Great idea, Yuvi. Speaking as someone who just downloaded diffs for a month of data from the streaming API for a research project, I certainly could see an 'augmented stream' with diffs included being very useful for research and also for bots. On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda yuvipa...@gmail.com wrote: On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote: If a lot of people are doing this, then perhaps it makes sense to have an 'augmented real time streaming' interface that is an exact replica of the streaming interface but with diffs added. Or rather, if I were to build such a thing, would people be interested in using it? -- Yuvi Panda T http://yuvi.in/blog ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Scott Hale Oxford Internet Institute University of Oxford http://www.scotthale.net/ scott.h...@oii.ox.ac.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Pageviews, mobile versus desktop
A graph I just generated while messing around with the high-granularity data we used in the monthly metrics readership report: http://ironholds.org/misc/pageviews_trends.png The thing I find really interesting about this is not the trend (mobile up, desktop down. As Lehrer said, this we know from nothing!) but the patterns. Mobile clusters far more tightly than desktop does. I'm not sure what this means (desktop users are weird? There's a lot of bot traffic we're not catching? That's my guess) but I thought it was pretty and might provoke some hypothesising. So, here you go! -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Pageviews, mobile versus desktop
Bah, you're right! Will reupload. Pageviews are bucketed by UTC day, although the axis is by months to avoid making it essentially unreadable. It's generated in ggplot2 using theme_bw() (one of my favourite combinations) On 13 December 2014 at 12:33, Ed Summers e...@pobox.com wrote: On Dec 13, 2014, at 12:18 PM, Oliver Keyes oke...@wikimedia.org wrote: I'm not sure what this means (desktop users are weird? There's a lot of bot traffic we're not catching? That's my guess) but I thought it was pretty and might provoke some hypothesising. So, here you go! I think the axis labels are flipped? How are the page views bucketed: day, week, month, something else? It is a pretty clean looking graph, what did you use to generate it? //Ed ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Pageviews, mobile versus desktop
Ooh, that's a really good point. In fact, we know there's different behaviour - mobile rises on weekends, desktop falls, but the desktop fall the mobile rise. I'm knee-deep in adjusted R2 values right now but I'll visualise that way and see what happens :) On 13 December 2014 at 13:17, Ed Summers e...@pobox.com wrote: It might be interesting to bucket by week to see if you still see the difference in clustering between desktop and mobile. I wonder if it’s a result of different behavior on desktop/mobile on weekdays/weekends? //Ed On Dec 13, 2014, at 12:37 PM, Oliver Keyes oke...@wikimedia.org wrote: Bah, you're right! Will reupload. Pageviews are bucketed by UTC day, although the axis is by months to avoid making it essentially unreadable. It's generated in ggplot2 using theme_bw() (one of my favourite combinations) On 13 December 2014 at 12:33, Ed Summers e...@pobox.com wrote: On Dec 13, 2014, at 12:18 PM, Oliver Keyes oke...@wikimedia.org wrote: I'm not sure what this means (desktop users are weird? There's a lot of bot traffic we're not catching? That's my guess) but I thought it was pretty and might provoke some hypothesising. So, here you go! I think the axis labels are flipped? How are the page views bucketed: day, week, month, something else? It is a pretty clean looking graph, what did you use to generate it? //Ed ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Pageviews, mobile versus desktop
http://ironholds.org/misc/pageviews_year_and_week.png - fascinating! It reveals a lot of seasonality in the desktop views - again, not replicated on mobile (at least, not so strongly) On 13 December 2014 at 13:49, Oliver Keyes oke...@wikimedia.org wrote: Ooh, that's a really good point. In fact, we know there's different behaviour - mobile rises on weekends, desktop falls, but the desktop fall the mobile rise. I'm knee-deep in adjusted R2 values right now but I'll visualise that way and see what happens :) On 13 December 2014 at 13:17, Ed Summers e...@pobox.com wrote: It might be interesting to bucket by week to see if you still see the difference in clustering between desktop and mobile. I wonder if it’s a result of different behavior on desktop/mobile on weekdays/weekends? //Ed On Dec 13, 2014, at 12:37 PM, Oliver Keyes oke...@wikimedia.org wrote: Bah, you're right! Will reupload. Pageviews are bucketed by UTC day, although the axis is by months to avoid making it essentially unreadable. It's generated in ggplot2 using theme_bw() (one of my favourite combinations) On 13 December 2014 at 12:33, Ed Summers e...@pobox.com wrote: On Dec 13, 2014, at 12:18 PM, Oliver Keyes oke...@wikimedia.org wrote: I'm not sure what this means (desktop users are weird? There's a lot of bot traffic we're not catching? That's my guess) but I thought it was pretty and might provoke some hypothesising. So, here you go! I think the axis labels are flipped? How are the page views bucketed: day, week, month, something else? It is a pretty clean looking graph, what did you use to generate it? //Ed ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?
-- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] What works for increasing editor engagement?
On 13 September 2014 20:52, James Salsman jsals...@gmail.com wrote: Pine wrote: I agree that the shift to mobile is a big deal; I do not agree: Active editor attrition began on its present trend in 2007, far before any mobile use was significant. I'm not seeing how that means it's not a big deal. Mobile now makes up 30% of our page views and its users display divergent behavioural patterns; you don't think a group that makes up 30% of pageviews is a user group that is a 'big deal' for engagement? I remain concerned that tech-centric approaches to editor engagement like VE and Flow, while perhaps having a modest positive impact, do little to fix the incivility problem that is so frequently cited as a reason for people to leave. I agree that VE has already proven that it is ineffective in significantly increasing editor engagement. And I agree that Flow has no hope of achieving any substantial improvements. There are good reasons to believe that Flow will make things worse. For example, using wikitext on talk pages acts as a pervasive sandbox substitute for practicing the use of wikitext in article editing. And I do not agree that civility issues have any substantial correlation with editor attrition. There have been huge civility problems affecting most editors on controversial subjects since 2002, and I do not see any evidence that they have become any worse or better on a per-editor basis since. My opinion is that the transition from the need to create new articles to maintaining the accuracy and quality of existing articles has been the primary cause of editor attrition, and my studies of Short Popular Vital Articles (WP:SPVA) have supported this hypothesis. Therefore, I strongly urge implementation of accuracy review systems: https://strategy.wikimedia.org/wiki/Proposal:Develop_systems_for_accuracy_review ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Joining derp?
Uhm. If you don't think there's any distinction in nature or terminology between 'The contents of a form field intentionally filled out and submitted by a user' andevery other kind of data, there's a disconnect here somewhere. On Thursday, 4 September 2014, Ed Summers e...@pobox.com wrote: On Sep 4, 2014, at 11:20 AM, aaron shaw aarons...@northwestern.edu javascript:_e(%7B%7D,'cvml','aarons...@northwestern.edu'); wrote: Sorry Ed, I don't think we all know that. In fact, I'm unaware of any way in which Wikimedia makes money based on data collected from its users. To my knowledge, the Foundation is supported almost entirely through private donations[1]. Ok, try this on for size: An edit to a Wikipedia article is data collected from its users. WMF receives millions of dollars of donations a year because of this data, and its accessibility. //Ed -- Sent from a portable device of Lovecraftian complexity. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early
I'm pretty sure that responding to well-intended and politely phrased criticism with sarcasm is probably also not something that will help us in avoiding losing contributors :p I agree that this is not an immediately understandable thing about contributions, although I think it should be more understandable by reaearchers than It might be by the man on the Clapham omnibus (an analogy would be 'not publishing the same paper in multiple journals') but my concern is that information exists on an axis. At one end we have the point at which the mass of information presented scares people off before they even hit save. At the other is the point at which the lack of information leads to somebody stumbling into a spiked pit. Our goal is to find a point in the middle, and I'm pretty cautious about attempts to add more documentation given that that's the direction we've historically trended in. On Wednesday, 23 July 2014, Kerry Raymond kerry.raym...@gmail.com wrote: Well, I’m glad it’s that simple (sarcasm intended!). Do we really expect new/occasional contributors to figure this out? Having been on Wikipedia for 9 years, it’s all news to me. I always thought that clicking SAVE with By clicking the Save page button, you agree to the Terms of Use https://wikimediafoundation.org/wiki/Terms_of_Use and you irrevocably agree to release your contribution under the CC BY-SA 3.0 License https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License and the GFDL https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License with the understanding that a hyperlink or URL is sufficient for CC BY-SA 3.0 attribution. that I was releasing **my** contribution, full stop, end of story. If we expect people to do more than this, shouldn’t it say something at this point like “If your contribution has previously been published elsewhere, please click here” and take people to a form where they can supply more details and then hit SAVE. Let’s make it easier for people to do the right thing instead of reverting them and losing them as contributors. Kerry -- *From:* wiki-research-l-boun...@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org'); [mailto:wiki-research-l-boun...@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','wiki-research-l-boun...@lists.wikimedia.org');] *On Behalf Of *Maggie Dennis *Sent:* Thursday, 24 July 2014 12:42 AM *To:* Research into Wikimedia content and communities *Subject:* Re: [Wiki-research-l] [Wikimedia-l] Catching copy and pasting early Just a few points inline. :) On Tue, Jul 22, 2014 at 5:50 AM, James Heilman jmh...@gmail.com javascript:_e(%7B%7D,'cvml','jmh...@gmail.com'); wrote: To clarify the proposal is: 1) only looking at new edits that add blocks of text over a certain size 2) only tagging those edits on a workspace page for further follow-up by an experienced human editor 3) only running on articles of WikiProjects that want it and are willing to follow-up (thus only WPMED for starters) What it is NOT is: a tool to add notices to article space, a tool to warn users on their talk pages, or a tool to look at old edits. It is also NOT many other things. This is a very narrow proposal. With respect to users who are adding content they own which they have previously had published. What you do is you get them in an email to agree to release it under a CC BY SA license and then send that email to OTRS. Alternatively, they can skip this step if they are reproducing materials from their own website by adding a release to that website. https://en.wikipedia.org/wiki/Wikipedia:DCM talks about how. I speak to that based on my volunteer experience, not my work experience. :) One further point - if they are the *sole* copyright holder contributing their own text work to Wikipedia, it must be colicensed under GFDL according to our terms of use https://wikimediafoundation.org/wiki/Terms_of_Use#7._Licensing_of_Content . Maggie With respect to the number of edits, WPMED gets about 1000 a day. If we say about 10% are of a significant size (a rather high estimate) and if we say copy and paste issues occur in 10% with a same number of false positives we are looking at 20 edits to review a day. Those within the project are able to handle this volume in a timely manner. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Maggie Dennis Senior Community Advocate Wikimedia Foundation, Inc. -- Sent from a portable device of Lovecraftian
[Wiki-research-l] this month's research newsletter
Both of these suggestions sound great to me! I'm not sure who the best person is to move them forward (I encourage anyone who wants to volunteer to speak up!) but whatever happens, I'm really grateful that we could turn this into a 'how do we fix this in the long-term?' conversation and not get bogged down - it's one of the most productive mailing list threads I've seen in a while :) On Thursday, 3 July 2014, Heather Ford hfor...@gmail.com javascript:_e(%7B%7D,'cvml','hfor...@gmail.com'); wrote: Thanks so much for this, Kerry. And thanks, Aaron for (as always) great, productive suggestions. I think there are two issues that need to be dealt with separately here. The first is about disparaging remarks made about researchers' contributions that kicked off this discussion. One idea that I had when I saw a similar problem earlier this year was to at least have reviewers add their names to reviews so that we are making a clear distinction between the opinion of a single reviewer and the community/organisation as a whole. Some reviewers have added their names to reviews (thank you!) but I think that needs to be a standard for the newsletter. This probably won't solve the problem completely but hopefully reviewers will be more thoughtful about their critique in the future. The second is to encourage research about Wikipedia that engages with the Wikimedia community. And yes, I, too, think that awards and acknowledgements are great ideas. I'd say that, when evaluating, engagement is even more important than impact because we want to encourage students and researchers at various stages of their careers (many of whom would not win awards for impact) to engage with the community when working on these projects. Of course, this kind of work is necessarily going to have more impact because Wikimedians themselves are going to be a part of it somehow. For this, I definitely agree with some kind of acknowledgement of research done - beyond, perhaps, just one or two star researchers winning a few awards. This can be done together e.g. awards for best papers in different categories but also acknowledgements for work with the community on particular projects as suggested by Kerry. Best, Heather. Heather Ford Oxford Internet Institute http://www.oii.ox.ac.uk Doctoral Programme EthnographyMatters http://ethnographymatters.net | Oxford Digital Ethnography Group http://www.oii.ox.ac.uk/research/projects/?id=115 http://hblog.org | @hfordsa http://www.twitter.com/hfordsa On 3 July 2014 02:56, Kerry Raymond kerry.raym...@gmail.com wrote: Having had a work role oversighting many university researchers including PHD and other research students, I think many start out with intentions to engage fully with stakeholders and contribute back into the real world in some way, but it's fair to say that deadline pressures tend to force them to focus their energies into the academically valued outcomes, e.g. published papers, theses, etc. This is just as true for Wikipedia-related research as for, say, aquaculture. Of course, some never intended to contribute back, but are solely motivated by climbing the greasy pole of academia. Because data gathering can be a time-consuming or expensive stumbling block in a research plan, organisations that freely publish detailed data (as WMF does) are natural magnets to researchers who can use that data to study various phenomena which may have broader relevance than just Wikipedia or where the Wikipedia data serves as a ground truth for other experiments or as proxy for other unavailable data. For example, you can use Wikipedia to study categorisation or named entity extraction without having real interest in Wikipedia itself. So I think it is for those who are passionate about Wikipedia itself to see how such research findings may be used to improve Wikipedia. As for releasing source code, it has to recognised that software in research projects is often very quick-and-dirty and probably not designed to be integrated into the MediaWiki code base. Effective solutions to Wikipedia issues often require a mix of technology and change to community process/culture (which is often far harder to get right). This is not to say they we should not encourage researchers to give back, but I think we do need to understand that the reasons people don't give back aren't always attributable solely to bad faith. In additions to suggestions already made re awards, just having a letter of commendation on WMF letterhead acknowledging the research and its potential to improve Wikipedia would be a useful thing especially for junior researchers seeking to establish themselves; this kind of external validation is helpful to their CVs. This could be sent to any researchers whose research was deemed to have merit with different wording for those who made (according to some appropriately-appointed group) greater or lesser contributions to real
Re: [Wiki-research-l] this month's research newsletter
. We need to remember that researchers are at very different stages of their careers, they have very different motivations, and different levels of engagement with the Wikipedia community, but that *all* research on Wikipedia contributes to our understanding (even if as a catalyst for improvements). We want to encourage more research on Wikipedia, not attack the motivations of people we know little about - particularly when they're just students and particularly when this newsletter is on housed on Wikimedia Foundation's domain. Best, Heather. [1] https://meta.wikimedia.org/wiki/Research:Newsletter/2014/June [2] https://meta.wikimedia.org/wiki/Research:Newsletter/2014/June#.22Recommending_reference_materials_in_context_to_facilitate_editing_Wikipedia.22 Heather Ford Oxford Internet Institute http://www.oii.ox.ac.uk/ Doctoral Programme EthnographyMatters http://ethnographymatters.net/ | Oxford Digital Ethnography Group http://www.oii.ox.ac.uk/research/projects/?id=115 http://hblog.org | @hfordsa http://www.twitter.com/hfordsa ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Is 'Random article' statistically robust over what population?
I don't know if anyone's looked into this, I'm afraid. I'd be interested to see what our replication lag on production is. I imagine it's pretty small, and so the impact would be negligible, but... On 27 June 2014 23:24, stuart yeates syea...@gmail.com wrote: I'm designing an experiment and want a random sample of wiki articles. The 'Random article' seems like a convenient way of generating these with having to compile a list of the population of articles myself. My hunch (based on clicking it lots and very little else), is that 'Random article' is a uniform sampling of pages in article namespace, excluding redirects but including disambiguation pages. As implemented on en.wiki (which is the wiki I'm starting on) it probably has a slight bias against very recently created pages (due to cross-server synchronization). Has anyone looked into this? cheers stuart ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Social Media and Learning - survey - Please help!
Well, it seems to be a general social media survey, not something specific to Wikimedia, so ;p. On 9 June 2014 17:59, Piotr Konieczny pio...@post.pl wrote: Is it just me or would anyone else be more motivated if instead of a drawing for an ipad the researchers would promise to write a Good Article or teach a course in the Wikipedia Education Program framework or such? -- Piotr Konieczny, PhD http://hanyang.academia.edu/PiotrKonieczny http://scholar.google.com/citations?user=gdV8_AEJ http://en.wikipedia.org/wiki/User:Piotrus On 6/6/2014 06:53, Anatoliy Gruzd wrote: Dear Wiki-Research-List instructors, teachers, faculty ... If you use social media for one or more of your classes, we would like to invite you to participate in an online survey. The survey should take you no longer than 35 minutes to complete. This survey is being conducted as part of a study on Social Media and Learning, supported by the Social Sciences and Humanities Research Council (SSHRC) of Canada. As a way to thank you for your participation in the survey, after completion, you will be given the option to enter your name and email address to enroll you in a random drawing to win one of three *Apple iPad minis*! The random drawing will take place on October 1, 2014 and the winner will be notified on the same day via email. Any optional contact information provided cannot be connected to your survey responses. If you would like to participate, please go to http://tinyurl.com/SMlearningsurvey PIs: Anatoliy Gruzd, Dalhousie University and Caroline Haythornthwaite, University of British Columbia *This survey has passed ethical review by both the Dalhousie University and the University of British Columbia ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Kill the bots
Okay. Methodology: *take the last 5 days of requestlogs; *Filter them down to text/html requests as a heuristic for non-API requests; *Run them through the UA parser we use; *Exclude spiders and things which reported valid browsers; *Aggregate the user agents left; *??? *Profit It looks like there are a relatively small number of bots that browse/interact via the web - ones I can identify include WPCleaner[0], which is semi-automated, something I can't find through WP or google called DigitalsmithsBot (could be internal, could be external), and Hoo Bot (run by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general framework that could be masking multiple underlying bots and has ~ 7.4m requests through the web interface in that time period. Obvious caveat is obvious; the edits from these tools may actually come through the API, but they're choosing to request content through the web interface for some weird reason. I don't know enough about the software behind each bot to comment on that. I can try explicitly looking for web-based edit attempts, but there would be far fewer observations that the bots might appear in, because the underlying dataset is sampled at a 1:1000 rate. [0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation On 20 May 2014 07:50, Oliver Keyes oke...@wikimedia.org wrote: Actually, belay that, I have a pretty good idea. I'll fire the log parser up now. On 20 May 2014 01:21, Oliver Keyes oke...@wikimedia.org wrote: I think a *lot* of them use the API, but I don't know off the top of my head if it's *all* of them. If only we knew somebody who has spent the last 3 months staring into the cthulian nightmare of our request logs and could look this up... More seriously; drop me a note off-list so that I can try to work out precisely what you need me to find out, and I'll write a quick-and-dirty parser of our sampled logs to drag the answer kicking and screaming into the light. (sorry, it's annual review season. That always gets me blithe.) On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote: Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee. I think Oliver is definitely right that this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas. Citation goldmine as one friend called it, I think. This won't address edit logs to date, but do we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API. 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181 Cheers, Scott ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Kill the bots
I think a *lot* of them use the API, but I don't know off the top of my head if it's *all* of them. If only we knew somebody who has spent the last 3 months staring into the cthulian nightmare of our request logs and could look this up... More seriously; drop me a note off-list so that I can try to work out precisely what you need me to find out, and I'll write a quick-and-dirty parser of our sampled logs to drag the answer kicking and screaming into the light. (sorry, it's annual review season. That always gets me blithe.) On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote: Thanks all for the comments on my paper, and even more thanks to everyone sharing these super helpful ideas on filtering bots: this is why I love the Wikipedia research committee. I think Oliver is definitely right that this would be a useful topic for some piece of method-comparing research, if anyone is looking for paper ideas. Citation goldmine as one friend called it, I think. This won't address edit logs to date, but do we know if most bots and automated tools use the API to make edits? If so, would it be feasibility to add a flag to each edit as to whether it came through the API or not. This won't stop determined users, but might be a nice way to identify cyborg edits from those made manually by the same user for many of the standard tools going forward. The closest thing I found in the bug tracker is [1], but it doesn't address the issue of 'what is a bot' which this thread has clearly shown is quite complex. An API-edit vs. non-API edit might be a way forward unless there are automated tools/bots that don't use the API. 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181 Cheers, Scott ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Kill the bots
___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikipedia traffic: selected language versions
Could you give an example of what we could do better than CLDR or the relevant ISO standards? On 18 May 2014 10:06, h hant...@gmail.com wrote: Dear Nemo, As I am waiting for a more complete response, I am not sure that I understand your last No as in No, we definitely can't means. To clarify, take the CLDR supplement Language-Territory information for example http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html One can suggest additions of the data point by submitting sourced numbers for a geo-linguistic population like this: http://unicode.org/cldr/trac/newticket?description=%3Cterritory%2c%20speaker%20population%20in%20territory%2c%20and%20references%3Esummary=Add%20territory%20to%20Traditional%20Chinese%20(zh_Hant) In Wikipedia articles and Wikidata pages, there are many attempts to provide more updated and better sourced data points. I see the potentials in exchanging such data, curating them better in Wikidata projects as more detailed and dynamic source than the CLDR. These data points will have extra benefits in curating traffic data. For one, these geo-linguistic population data points would be useful to normalize traffic data for further analysis, such as geographic normalization. For another, they provide important reference data for the development strategies and policies of the Wikipedia projects. Best, han-teng liao 2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) nemow...@gmail.com: Thanks for your suggestions. Just some quick pointers below. h, 18/05/2014 08:26: (I-A). Tabulate the data points in absolute numbers first, not percentage numbers [...] (I-B). Include all language versions for the *editing traffic* report as well. [...] (I-C). Provide static data objects in more accessible format (i.e. csv and/or json). [...] (II-A). Putting viewing traffic and editing traffic report on the same page. [...] (II-B). Organizing and archiving the traffic reports for historical comparison. [...] (I-C). Provide dynamic data objects in more accessible format (i.e. csv and/or json). At least the first four are just changes in the WikiStats reports formatting, personally I encourage you to submit patches: https://git.wikimedia.org/summary/analytics%2Fwikistats.git (should be the squids directory, but there is some ongoing refactoring of the repos). On archives and history rewriting/reports regeneration, see also https://bugzilla.wikimedia.org/show_bug.cgi?id=46198 [...] (III-B). Smaller (i.e more specific) geographic aggregate units. The country (geographic) information is often based on geo-IP databases, and sometimes provincial and city-level data would be available. http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html [...] ( I know that the Unicode Common Locale Data Repository (CLDR Version 25 http://cldr.unicode.org/index/downloads/cldr-25) provides“language-territory” http://www.unicode.org/cldr/charts/latest/supplemental/ language_territory_information.htmlor “territory-language” http://www.unicode.org/cldr/charts/latest/supplemental/ territory_language_information.htmlunit-based charts, but I believe that the Wikimedia projects can use and build one better..) [...] No, we definitely can't, not alone. I've asked for help, please contribute: https://www.mediawiki.org/wiki/Universal_Language_ Selector/FAQ#How_does_Universal_Language_Selector_ determine_which_languages_I_may_understand. Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Fwd: [Wmfall] Next research data showcase: tomorrow at 11.30
Beginning in 10 minutes :) public stream link: https://www.youtube.com/watch?v=bozyc1z25aQ -- Forwarded message -- From: Dario Taraborelli dtarabore...@wikimedia.org Date: 18 March 2014 20:42 Subject: [Wmfall] Next research data showcase: tomorrow at 11.30 To: wmf...@lists.wikimedia.org Staff wmf...@lists.wikimedia.org The next Research Data showcasehttps://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase will be live-streamed tomorrow at 11.30 PT (the streaming link will be posted on the list a few minutes before the showcase starts. Those of you who are in the SF office can join us in Yongle). This month's program is below, we look forward to seeing you. Dario *Metrics standardization *(Dario) In this talk I'll present the most recent updates on our work on participation metrics and discuss the goals of the Editor Engagement Vital Signs project. *Wikipedia's rise and decline *(Aaron) In Halfaker et al. (2013) we present data that show that several changes the Wikipedia community made to manage quality and consistency in the face of a massive growth in participation have ironically crippled the very growth they were designed to manage. Specifically, the restrictiveness of the encyclopedia's primary quality control mechanism and the algorithmic tools used to reject contributions are implicated as key causes of decreased newcomer retention. ___ Wmfall mailing list wmf...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall -- Oliver Keyes Product Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikinews original reporting value as a measure of news events
Questions: What are those 22 variables? How many datapoints did you get, distributed between how many categories? How are you measuring correlation? Are we talking Pearson's? On Sat, Sep 7, 2013 at 1:30 PM, Laura Hale la...@fanhistory.com wrote: https://meta.wikimedia.org/wiki/Research:Wikinews_original_reporting_value_as_a_measure_of_news_eventsThis is the first in a series of research pieces I am doing as part of program design efforts for The Wikinewsie Group. Any feedback would be appreciated. :) Sincerely, Laura Hale -- twitter: purplepopple blog: ozziesport.com ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikinews original reporting value as a measure of news events
The only list I see there has 18. On Sun, Sep 8, 2013 at 3:49 AM, Laura Hale la...@fanhistory.com wrote: https://en.wikipedia.org/wiki/News_values The 22 items in the lst there. Sincerely, Laura Hale On Sunday, September 8, 2013, Oliver Keyes wrote: Questions: What are those 22 variables? How many datapoints did you get, distributed between how many categories? How are you measuring correlation? Are we talking Pearson's? On Sat, Sep 7, 2013 at 1:30 PM, Laura Hale la...@fanhistory.com wrote: https://meta.wikimedia.org/wiki/Research:Wikinews_original_reporting_value_as_a_measure_of_news_eventsThis is the first in a series of research pieces I am doing as part of program design efforts for The Wikinewsie Group. Any feedback would be appreciated. :) Sincerely, Laura Hale -- twitter: purplepopple blog: ozziesport.com ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- mobile: 635209416 twitter: purplepopple blog: ozziesport.com ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Why are users blocked on Wikipedia?
On Sun, May 5, 2013 at 7:39 AM, Laura Hale la...@fanhistory.com wrote: On Sat, May 4, 2013 at 2:43 PM, Federico Leva (Nemo) nemow...@gmail.comwrote: ENWP Pine, 04/05/2013 08:36: Ironholds, would you be interested in investigating how stewards, global sysops, and global rollbackers might be helpful in dealing with the spam problem, especially for small wikis, and what new steps would be useful? I doubt they need suggestions, they need tools: https://www.mediawiki.org/**wiki/Admin_tools_developmenthttps://www.mediawiki.org/wiki/Admin_tools_development The question is rather how much they are already helping: botspam, obvious crosswiki vandalism and NOP are mostly handled globally,[1] so local logs can only help assessing what's consuming the local communities time, not what are the true menaces. In worst case, of course, you may even be measuring the excuses to block rather than most important problems users were creating (similarly to Al Capone ;) ). The following is an analysis of the entire block log on English Wikinews. It is currently at https://en.wikinews.org/wiki/User:LauraHale/Blocks_on_English_Wikinews *Ironholds wrote a summary of problems on English Wikipedia viewed through block logs http://blog.ironholds.org/?p=31 in late April. This is nominally based on that research to the extent that it is inspired by it in terms of understanding blocking on English Wikinews.* As referenced on the WMF research list, the issue of blocking is potentially a very big deal for smaller projects. Problems can easily overwhelm a small community if there is not an active community patrolling recent changes in addition to the content work they are engaging in. For English Wikinews, there were 22 active reporters in January 2013http://stats.wikimedia.org/wikinews/EN/TablesWikipediansEditsGt5.htm. (This is tiny when SUL is a larger contributor to English Wikinews having 720,753 total registered users of which 0.00069% were active in January.) At the same time, there were 64 blocks made that month. 39 of these blocks were for spam. English Wikinews is one of the fortunate smaller projects: We have two local Check Users https://en.wikinews.org/wiki/Wikinews:CUwho respond quickly to problems. We generally have at least one admin awake and monitoring recent changes at any given time. We have global CUs who can and sometimes come in and block the big problems. Thus, we can deal with the automated problem quite easily. Since English Wikinews has opened and 27 April 2013, there have been 15,105 un/blocks. The following is based on the complete block log. On the project, 4 types of block actions exist on English Wikinews: block, unblock, log action removed and changed block settings for. They all appear in the same block log, thus the 15,105 number is not total blocks but total block related action. (If you harass a user, get blocked for a week for it, get unblocked after promising to behave, get reblocked and then have your block extended, there are three distinct actions. If you do that with an offensive user name, the log action may be hidden, which is a fourth type of action.) Since 2005, 99 different people have taken an administrative blocking action on English Wikinews. While there are currently only 36 admins on English Wikinews, the number used to be higher and local policy is if you do not use your admin privileges, you lose them. This is to prevent potential abuse and to make sure all admins are aware of current policy in order to prevent wheel-warring and other potentially damaging things to the community. Amgine has blocked the most users on the project with 7162 block related actions. Brian McNeil is second with 1522 related block actions. He has been less active in the past 18 months or so. Cirt is third with 723. Cirt is our most active local Check User. Pi zero is fourth with 503 block related administrative actions. Tempodivalse, who is no longer involved with the project, rounds out the top five with 487 block related administrative actions. Amongst the next five administrators with blocks, only one is actively involved on the administrator side, Cspurrier, who is also a Check User who is ranked seventh for total administrator block related actions with 343. Who is getting blocked? There are 14,815 total entries identified to specific accounts where an administrative blocking account was acted upon the account, of which there were 12,643 unique accounts where administrative blocking related action was taken connected to the account. This means that there are 1,100 accounts with 2 or more administrative blocked related actions taken regarding them. This is somewhat ambiguous; 2 or more blocked-related actions could mean either multiple blocks or a single unblock. There were a number of people with large numbers of block related administrator actions taken in relation to the account, including Neutralizer with 71,