Re: [Wiki-research-l] Upcoming research newsletter: new papers open for review
Great idea Heather, I will add my name to my review. Do you know any other review sites that aggregate in a wiki way that we could emulate? Maximilian Klein Wikipedian in Residence, OCLC +17074787023 From: wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.org on behalf of Tilman Bayer tba...@wikimedia.org Sent: Tuesday, February 25, 2014 9:16 AM To: Research into Wikimedia content and communities Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Wiki-research-l] Upcoming research newsletter: new papers open for review Hi Heather, that's a cool idea, and we have actually been considering something like this already. While the names of the reviewers are prominently displayed in the byline on top (and also, many readers of the Signpost and the newsletter are of course experienced in reading version histories), showing them next to each review might be make attribution easier. We just haven't found the time to implement it yet, like with many other things for the newsletter. You are welcome to figure out a suitable format and add these attributions in the upcoming issue, let's follow up offlist if more information is needed. On Tue, Feb 25, 2014 at 1:11 AM, Heather Ford hfor...@gmail.com wrote: Thanks, Dario, Tilman! I was wondering whether it would be helpful to add reviewer names/usernames to individual signpost reviews. I was struck while reading a review of a paper on Signpost recently that I felt like the reviewer was inserting some very opinionated statements about the article rather than the regular summaries. While I don't think that this is a problem necessarily (although I wish that they were a bit more informed about the topic and social science research in general), I do think it can be problematic to have these comments unattributed. Would be interested to hear what others think... Best, Heather. Heather Ford Oxford Internet Institute Doctoral Programme EthnographyMatters | Oxford Digital Ethnography Group http://hblog.org | @hfordsa On 25 February 2014 05:26, Tilman Bayer tba...@wikimedia.org wrote: Hi Max, yes, we're co-publishing with the Signpost, so the ultimate deadline is the Signpost's actual publication time. Its formal publication date is this Wednesday (the 26th) UTC, although actual publication might take place several hours or even a few days later. Thanks for signing up to review the Editor's Biases paper, I'm looking forward to reading your summary! On Mon, Feb 24, 2014 at 3:39 PM, Klein,Max kle...@oclc.org wrote: Dario, what's the timeframe for writing reviews so they can get into the signpost in time. 25th? Maximilian Klein Wikipedian in Residence, OCLC +17074787023 From: wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.org on behalf of Dario Taraborelli dtarabore...@wikimedia.org Sent: Monday, February 24, 2014 8:11 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.; Research into Wikimedia content and communities Subject: [Wiki-research-l] Upcoming research newsletter: new papers open forreview Hi everybody, with CSCW just concluded and conferences like CHI and WWW coming up we have a good set of papers to review for the February issue of the Research Newsletter [1] Please take a look at: https://etherpad.wikimedia.org/p/WRN201402 and add your name next to any paper you are interested in reviewing. As usual, short notes and one-paragraph reviews are most welcome. Instead of contacting past contributors only, this month we're experimenting with a public call for reviews cross-posted to analytics-l and wiki-research-l. if you have any question about the format or process feel free to get in touch off-list. Dario Taraborelli and Tilman Bayer [1] http://meta.wikimedia.org/wiki/Research:Newsletter ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Tilman Bayer Senior Operations Analyst (Movement Communications) Wikimedia Foundation IRC (Freenode): HaeB ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Tilman Bayer Senior
Re: [Wiki-research-l] Upcoming research newsletter: new papers open for review
Dario, what's the timeframe for writing reviews so they can get into the signpost in time. 25th? Maximilian Klein Wikipedian in Residence, OCLC +17074787023 From: wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.org on behalf of Dario Taraborelli dtarabore...@wikimedia.org Sent: Monday, February 24, 2014 8:11 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.; Research into Wikimedia content and communities Subject: [Wiki-research-l] Upcoming research newsletter: new papers open for review Hi everybody, with CSCW just concluded and conferences like CHI and WWW coming up we have a good set of papers to review for the February issue of the Research Newsletter [1] Please take a look at: https://etherpad.wikimedia.org/p/WRN201402 and add your name next to any paper you are interested in reviewing. As usual, short notes and one-paragraph reviews are most welcome. Instead of contacting past contributors only, this month we’re experimenting with a public call for reviews cross-posted to analytics-l and wiki-research-l. if you have any question about the format or process feel free to get in touch off-list. Dario Taraborelli and Tilman Bayer [1] http://meta.wikimedia.org/wiki/Research:Newsletter ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Preexsiting Researchers on Metrics for Users?
Thanks Nemo, I'll re-read that discussion. I think that conversation is where I became tentative of using bytes or edit counts. Aaron, in my own search I also noticed you wrote with Geiger. About counting edit hour and edit sessions. [1] Calculating content persistence is a bit too heavyweight for me right now since I am trying to submit to ACM Web Science in 2 weeks (hose CFP was just on this list). The technique looks great though, and I would like to help support making a WMFlabs tool that can return this measure. It seems like I could calculate approximate edit-hours from just looking at Special:Contributions timestamps. Is that correct? Would you suggest this route? [1] http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Measure_Participation_in_Wikipedia/geiger13using-preprint.pdf Maximilian Klein Wikipedian in Residence, OCLC +17074787023 From: wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.org on behalf of Aaron Halfaker aaron.halfa...@gmail.com Sent: Friday, February 07, 2014 7:12 AM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] Preexsiting Researchers on Metrics for Users? Hey Max, There's a class of metrics that might be relevant to your purposes. I refer to them as content persistence metrics and wrote up some docs about how they work including an example. See https://meta.wikimedia.org/wiki/Research:Content_persistence. I gathered a list of papers below to provide a starting point. I've included links to open access versions where I could. These metrics are a little bit painful to compute due to the computational complexity of diffs, but I have some hardware to throw at the problem and another project that's bringing me in this direction, so I'd be interested in collaborating. Priedhorsky, Reid, et al. Creating, destroying, and restoring value in Wikipedia. Proceedings of the 2007 international ACM conference on Supporting group work. ACM, 2007. http://reidster.net/pubs/group282-priedhorsky.pdf: * Describes Persistent word views which is a measure of value added per editor. (IMO, value actualized) B. Thomas Adler, Krishnendu Chatterjee, Luca de Alfaro, Marco Faella, Ian Pye, and Vishwanath Raman. 2008. Assigning trust to Wikipedia content. In Proceedings of the 4th International Symposium on Wikis (WikiSym '08). ACM, New York, NY, USA, , Article 26 , 12 pages. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.2047rep=rep1type=pdf * Describes a complex strategy for assigning trustworthiness to content based on implicit review. See http://wikitrust.soe.ucsc.edu/ Halfaker, A., Kittur, A., Kraut, R., Riedl, J. (2009, October). A jury of your peers: quality, experience and ownership in Wikipedia. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (p. 15). ACM. http://www-users.cs.umn.edu/~halfak/publications/A_Jury_of_Your_Peers/halfaker09jury-personal.pdf * Describes the use of Persistent word revisions per word as a measure of article contribution quality. Halfaker, A., Kittur, A., Riedl, J. (2011, October). Don't bite the newbies: how reverts affect the quantity and quality of Wikipedia work. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (pp. 163-172). ACM. http://www-users.cs.umn.edu/~halfak/publications/Don't_Bite_the_Newbies/halfaker11bite-personal.pdf * Describes the use of raw Persistent work revisions as a measure of editor productivity * Looking back on the study, I think I'd rather use log(# of revisions a word persists) * words. -Aaron On Fri, Feb 7, 2014 at 1:48 AM, Federico Leva (Nemo) nemow...@gmail.commailto:nemow...@gmail.com wrote: Sort of related, an ongoing education@ discussion student evaluation criteria. http://thread.gmane.org/gmane.org.wikimedia.education/854 Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Preexsiting Researchers on Metrics for Users?
Hello All, Can you point me to research, or have ideas about metrics of user performance? I know edit count, and total bytes have their limitations. Right now I am counting the occurrences of thank appreciate and barnstar for a User in User talk namespace (and recursive subpages). What else is there? Let me explain more about my current project I am trying to develop some new techniques to measure user and article performance. I am repurposing the bi-partite economics trade model of countries-products, but instead using editors-articles. This means that I arrive at a new metric for users, and articles. Now I am calibrating some of the variables in this model, by comparing my results to exogenous variable. On pages, I use the metric that this listed pointed me to last time, like the actionable metrics from Group lens, and cleanup tags from Stein. (Thank list!). When I rank articles in a category using my economics method, versus the article-text methods I acheive .7 spearman correlation. Using my count-thanks-on-user-talk method for users in the user domain I acheive .50 spearman ranking correlation, which is still quite good, but I want to make sure there aren't better baselines to which to compare. Thanks, Maximilian Klein Wikipedian in Residence, OCLC +17074787023 ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Polling the watcher's of a page. Possible?
Hello Research, It it possible to query for the watchers of a page? It does not seem to be in the API, nor is the watchers or wl_user table in the Data Base replicas (where I thought MediaWiki stores it. I imagine this is for privacy reasons, correct? If so, how would one gain access? I have been talking with an econophysicist who thinks that we could apply a contagion algorithm, to see which edits are contagious. (I met this econopyhicist at the Berkeley Data Science Faire at which Wikimedia Analytics presented, so it was worth it in the end). Maximilian Klein Wikipedian in Residence, OCLC +17074787023 ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to collect all the admin-specific edits for a subset of Wp admins
Hello Jerome, I'm not sure this is the best way, but pywikipediabot [1] has a library called pagegenerators.py and there is a function def UserContributionsGenerator(username) (around line 706). That would allow you to iterate through theses user names, and I bet there will be a special marking for deletions/undeletions. If not, worst comes to worse you can use a regular expression for those words. [1] https://meta.wikimedia.org/wiki/pywikipediabot When you use have a pywikibot-hammer everything looks like a pywikibot-nail! Maximilian Klein Wikipedian in Residence, OCLC +17074787023 From: wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.org on behalf of J?r?me Hergueux jerome.hergu...@gmail.com Sent: Thursday, October 10, 2013 3:11 AM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] How to collect all the admin-specific edits for a subset of Wp admins Dear all, I am starting this thread in the hope that some of the great Wiki researchers on this list could advise me on a data collection problem. Here is the question: for a each of 120 Wikipedia admins (for whom I have the usernames and unique numeric ids), I would like to reliably count the number of times they (i) deleted a page (ii) undeleted (i.e. restored) a page (iii) protected a page (iv) blocked a user and (v) unblocked a user. Those types of edits all correspond to a specific action in the Wikipedia API documentation page (http://en.wikipedia.org/w/api.php): action=delete, action=undelete, action=protect, action=block and action=unblock. I don't know, however, what would be the best strategy to go about collecting those edits. Does anyone have an idea about which data collection strategy I should adopt in this case? Is there a way to query the Wikipedia API directly, or should I look for some specific markers in the edit summaries? I would be very grateful for any advice of feedback! Thanks much for your attention and time. :) Best, J?r?me. ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] diffdb formatted Wikipedia dump
Susan, Hmm, seems like that is a funny middle ground, where it's too long to get live - although its probably less than 158 days. I once read an edited 400,000 pages with pywikibot (3 network IO calls per page, read, external API, write) in about 20 days. You would have to make two IO calls (read, getHistory), per userpage. I don't know how many userpages there are, but that might be enough variables to satisfy a system of inequalities that you need. If you are deadset on using hadoop, maybe you could use the Wikimedia Labs XGrid https://wikitech.wikimedia.org/wiki/Main_Page. They have some monster power and it's free for bot operators and other tool runners. Maybe it's also worth asking on there if someone already has wikihadoop set up. Maximilian Klein Wikipedian in Residence, OCLC +17074787023 From: wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.org on behalf of Susan Biancani inacn...@gmail.com Sent: Tuesday, October 08, 2013 3:28 PM To: Research into Wikimedia content and communities Subject: Re: [Wiki-research-l] diffdb formatted Wikipedia dump Right now, I want all the edits to user pages and user talk pages, 2010-2013. But as I keep going with this project, I may want to expand a bit, so I figured if I was going to run the wikihadoop software, I might as well only do it once. I'm hesitant to do this via web scraping, because I think it'll take much longer than working with the dump files. However, if you have suggestions on how to get the diffs (or a similar format) efficiently from the dump files, I would definitely love to hear them. I appreciate the help and advice! On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais pierrecarl.langl...@gmail.commailto:pierrecarl.langl...@gmail.com wrote: I agree with Klein. If you do not need to exploit the entire Wikipedia database, requests through a python scraping library (like Beautiful Soup) are certainly sufficient and easy to set up. With an aleatory algorithm to select the ids you can create a fine sample. PCL Le 07/10/13 19:31, Klein,Max a écrit : Hi Susan, Do you need the entire database diff'd? I.e. all edits ever. Or are you interested in a particular subset of the diffs? It would help to know your purpose. For instance I am interested in diffs around specific articles for specific dates to study news events. So I calculate the diffs myself using python on page histories rather than the entire database. Maximilian Klein Wikipedian in Residence, OCLC +17074787023tel:%2B17074787023 From: wiki-research-l-boun...@lists.wikimedia.orgmailto:wiki-research-l-boun...@lists.wikimedia.org wiki-research-l-boun...@lists.wikimedia.orgmailto:wiki-research-l-boun...@lists.wikimedia.org on behalf of Susan Biancani inacn...@gmail.commailto:inacn...@gmail.com Sent: Thursday, October 03, 2013 10:06 PM To: wiki-research-l@lists.wikimedia.orgmailto:wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] diffdb formatted Wikipedia dump I'm looking for a dump from English Wikipedia in diff format (i.e. each entry is the text that was added/deleted since the last edit, rather than each entry is the current state of the page). The Summer of Research folks provided a handy guide to how to create such a dataset from the standard complete dumps here: http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff But the time estimate they give is prohibitive for me (20-24 hours for each dump file--there are currently 158--running on 24 cores). I'm a grad student in a social science department, and don't have access to extensive computing power. I've been paying out of pocket for AWS, but this would get expensive. There is a diff-format dataset available, but only through April, 2011 (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format dataset for January, 2010- March, 2013 (or, for everything up to March, 2013). Does anyone know if such a dataset exists somewhere? Any leads or suggestions would be much appreciated! Susan ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l