Hi Morten, Thanks a lot for your advice and code!!!
Shiyue 2016-06-16 0:08 GMT+08:00 Morten Wang <[email protected]>: > Hi Shiyue, > > Whether you choose to use a set time period (e.g. 6 months like Kittur & > Kraut) or use assessment changes as your criteria, there are additional > factors you'll have to consider. If you use a set time period there are at > least three issues you'll need to consider: 1) do articles change quality > at the same pace? 2) how long before the start of your time period did an > article get its assessment? 3) what happened to your article between its > assessment and the start of your time period? > > If you instead choose to use rating changes, you have the issue that those > happen at different times, so you'll have to control for the time lapsed > between them if you're comparing articles to each other, as well as perhaps > trying to figure out if an article has an inherent probability for change. > As long as you consider these types of related issues and control for them, > your approach should be sane. > > I've put the code up on Github: > https://github.com/nettrom/assessments/blob/master/clean-training-set.py > It uses a few support files that are all in the same repository ( > https://github.com/nettrom/assessments): assessment.py, db.py, and > revisions.py > > Since I have a Tool Labs[1] account and Pywikibot[2] already set up, the > code is written to use the replicated databases for fetching revisions and > such, and Pywikibot as the library to interact with Wikipedia's API. > Neither of those are hard requirements, you can use the API instead of the > database access, and switch Pywikibot out with your favourite way of > accessing the API :) It also uses mwparserfromhell[3] to parse the > wikitext. I don't know of a better parser to use, but if you have one feel > free to use that instead. > > > References: > 1: https://tools.wmflabs.org > 2: https://www.mediawiki.org/wiki/Manual:Pywikibot > 3: http://mwparserfromhell.readthedocs.io/en/latest/ > > > Cheers, > Morten > > > On 13 June 2016 at 09:03, Shiyue Zhang <[email protected]> wrote: > >> Hi Morten, >> >> Thanks a lot for your reply!!! I have read your paper: Tell me more: An >> actionable quality model for Wikipedia. Thanks for introducing me your >> another work in CSCW 2015, I will read it later. >> >> I saw your data. As you mentioned, it only has the revisions when the >> assessment changed. But, I prefer to get all of the revisions between 2 >> assessment changes, since I want to study what makes the quality change and >> to predict the quality change. Before, I consider to adopt Kittur et al's >> formalization of quality changes in 6 months [1]. The problem is I cannot >> get the precise quality at the start and end point of 6-month period. Now I >> think I can take the period between 2 assessment changes, though it is >> also not a perfect answer, if articles are not regularly assessed, as Kerry >> and Andrew mentioned. >> >> I know you have a lot of experience in Wikipedia quality research. Could >> you give me some advices or references about the quality change study? And >> it cannot be more great if you could give me your Python code to get the >> data. I can modify it to get the data I need. Thanks a lot! >> >> References: >> Kittur A, Kraut R E. Harnessing the wisdom of crowds in wikipedia: >> quality through coordination[C]// ACM Conference on Computer Supported >> Cooperative Work. ACM, 2008:37-46. >> >> Cheers, >> Shiyue >> >> >> >> >> 2016-06-10 23:20 GMT+08:00 Morten Wang <[email protected]>: >> >>> Hi Shiyue, >>> >>> The issues around assessments that have been brought up are valid and >>> useful to keep in mind when trying to build machine learners that do >>> quality predictions. That being said, ORES quality classifier[1] is (AFAIK) >>> trained on a dataset[2] that I've gathered based on the method I used to >>> get a dataset to train the classifier used in our CSCW 2015 paper[3]. The >>> revisions that are in that dataset were gathered by taking a snapshot of >>> the quality assessment classes and then walking backwards through the talk >>> page revision history to find the time when the assessment changed, and >>> then grabbing the revision of the article at that timestamp. If you want >>> Python code instead of the dataset, let me know. >>> >>> The team behind ORES has also been working on writing scripts that'll do >>> assessment extractions (see for instance [4]), in case you want to process >>> a dump and get all of them. So far our experience with that is that it >>> leads to slightly lower performance. Although we're uncertain as to why, my >>> guess is that the dataset is noisier, perhaps due to changing quality >>> criteria as Andrew points to. >>> >>> Please do get in touch if you have any questions! >>> >>> References: >>> 1: https://meta.wikimedia.org/wiki/ORES/wp10 >>> 2: >>> https://figshare.com/articles/English_Wikipedia_Quality_Asssessment_Dataset/1375406 >>> 3: >>> http://www-users.cs.umn.edu/~morten/publications/cscw2015-improvementprojects.pdf, >>> see Appendix A for info on the classifier >>> 4: >>> https://github.com/wiki-ai/wikiclass/blob/master/wikiclass/extractors/enwiki.py >>> >>> Cheers, >>> Morten >>> >>> >>> On 10 June 2016 at 00:59, Andrew Gray <[email protected]> wrote: >>> >>>> Hi Shiyue, >>>> >>>> I agree with Kelly - these ratings probably won't do what you need, in >>>> that case. Sorry! >>>> >>>> We simply don't have the people (or the enthusiasm) required to do >>>> regular updates and I'd guess many are well over five years 'stale' since >>>> last rating - and most will only ever have been rated once. >>>> >>>> There's a second complicating factor for old ratings - not only are >>>> they stale, but the general standards for that rating might have changed. >>>> (See eg >>>> http://www.generalist.org.uk/blog/2010/quality-versus-age-of-wikipedias-featured-articles/ >>>> for a demonstration of that last point - it would be interesting to use >>>> ORES to do a bigger sample) >>>> >>>> Andrew. >>>> On 10 Jun 2016 07:13, "Shiyue Zhang" <[email protected]> wrote: >>>> >>>>> Hi Kerry, >>>>> >>>>> Thanks a lot for your reply! Honestly, I am not aware of the problem >>>>> you mentioned that many wikiprojects don't do regular quality assessment. >>>>> This problem really matters to me, because I want to get the relatively >>>>> true quality of a revision of an article. I know Aaron's automated quality >>>>> assessment tool, but it is also based on a machine learning classifier, >>>>> which is also my goal to automatically predict quality, especially quality >>>>> change. So I can't take the results of this tool as my ground truth. >>>>> >>>>> 2016-06-10 12:16 GMT+08:00 Kerry Raymond <[email protected]>: >>>>> >>>>>> If you are not aware of it, many wikiprojects don’t do any kind of >>>>>> regular quality assessment. Often an article is project-tagged and >>>>>> assessed >>>>>> when it’s new (which generally means the quality is assessed >>>>>> stub/start/C) >>>>>> and then it’s never re-assessed unless someone working on it is trying to >>>>>> get it to GA or similar and hence actively requests assessment. >>>>>> >>>>>> >>>>>> >>>>>> So it’s easy for an article to be much better quality (or even much >>>>>> worse quality, although that’s probably less likely) than its current >>>>>> assessment. >>>>>> >>>>>> >>>>>> >>>>>> I think you might do better to use Aaron’s automated quality >>>>>> assessment tool and apply it to different versions of a set of article >>>>>> and >>>>>> see how that changes over time. Whatever the deficiencies of an automated >>>>>> tool, I suspect it’s still more reliable than the human processes that we >>>>>> actually have. But I guess it depends on whether the focus of your study >>>>>> is >>>>>> the quality of articles or is it the process of assessing the quality of >>>>>> articles? My sense is that you are interested in the former rather than >>>>>> the >>>>>> latter. >>>>>> >>>>>> >>>>>> >>>>>> Kerry >>>>>> >>>>>> >>>>>> >>>>>> *From:* Wiki-research-l [mailto: >>>>>> [email protected]] *On Behalf Of *Shiyue >>>>>> Zhang >>>>>> *Sent:* Friday, 10 June 2016 12:42 PM >>>>>> *To:* Research into Wikimedia content and communities < >>>>>> [email protected]> >>>>>> *Subject:* Re: [Wiki-research-l] How to get the exact date when an >>>>>> article get a quality promotion? >>>>>> >>>>>> >>>>>> >>>>>> Hi Pine, >>>>>> >>>>>> >>>>>> >>>>>> Thanks for your reply. Yes, it is English Wikipedia. Exactly I want >>>>>> to get the timestamp of an article's quality rating change. I know >>>>>> the particular diffs shouldn't be considered as the reason why quality >>>>>> rating change. I'm trying to get a prediction of quality change beyond a >>>>>> certain time period, so I need the start and end quality of the time >>>>>> period. >>>>>> >>>>>> >>>>>> >>>>>> I hope anyone have the experience on this problem can give me some >>>>>> advice. Thanks a lot!!! >>>>>> >>>>>> >>>>>> >>>>>> 2016-06-10 9:47 GMT+08:00 Pine W <[email protected]>: >>>>>> >>>>>> Hi Zhang, >>>>>> >>>>>> Is this for English Wikipedia? >>>>>> >>>>>> You can probably use automation to find the timestamp of an article's >>>>>> quality rating change on English Wikipedia. Other people on this list >>>>>> probably know how to do this, and they may comment here. >>>>>> >>>>>> However, that does not imply that any paricular diffs should be >>>>>> considered to have a quality that is equivalent to the quality of the >>>>>> article. Measuring the quality of diffs is an inexact science, but you >>>>>> might want to take a look at Revision Scoring. Aaron Halfaker can tell >>>>>> you >>>>>> more about how useful, or not, Revision Scoring is for measuring the >>>>>> quality of diffs. Hopefully he will respond to this email. >>>>>> >>>>>> Pine >>>>>> >>>>>> On Jun 9, 2016 18:29, "Shiyue Zhang" <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> >>>>>> I'm doing research on Wikipedia article quality, and I take advantage >>>>>> of WikiProject Assessments. But I can only get the latest quality level >>>>>> of >>>>>> an article. I wonder how to get the quality of each revision, or how to >>>>>> get the exact date when an article get a quality promotion, for example, >>>>>> from A-class to FA-class. >>>>>> >>>>>> >>>>>> >>>>>> I really need your help! Thanks! >>>>>> >>>>>> >>>>>> >>>>>> Zhang Shiyue >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Zhang Shiyue >>>>>> >>>>>> *Tel*: +86 18801167900 >>>>>> >>>>>> *E-mail*: [email protected], [email protected] >>>>>> >>>>>> State Key Laboratory of Networking and Switching Technology >>>>>> >>>>>> No.10 Xitucheng Road, Haidian District >>>>>> >>>>>> Beijing University of Posts and Telecommunications >>>>>> >>>>>> Beijing, China. >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Wiki-research-l mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Wiki-research-l mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Zhang Shiyue >>>>>> >>>>>> *Tel*: +86 18801167900 >>>>>> >>>>>> *E-mail*: [email protected], [email protected] >>>>>> >>>>>> State Key Laboratory of Networking and Switching Technology >>>>>> >>>>>> No.10 Xitucheng Road, Haidian District >>>>>> >>>>>> Beijing University of Posts and Telecommunications >>>>>> >>>>>> Beijing, China. >>>>>> >>>>>> _______________________________________________ >>>>>> Wiki-research-l mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Zhang Shiyue >>>>> >>>>> *Tel*: +86 18801167900 >>>>> >>>>> *E-mail*: [email protected], [email protected] >>>>> >>>>> State Key Laboratory of Networking and Switching Technology >>>>> >>>>> No.10 Xitucheng Road, Haidian District >>>>> >>>>> Beijing University of Posts and Telecommunications >>>>> >>>>> Beijing, China. >>>>> >>>>> _______________________________________________ >>>>> Wiki-research-l mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>>> >>>>> >>>> _______________________________________________ >>>> Wiki-research-l mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> >> -- >> >> Zhang Shiyue >> >> *Tel*: +86 18801167900 >> >> *E-mail*: [email protected], [email protected] >> >> State Key Laboratory of Networking and Switching Technology >> >> No.10 Xitucheng Road, Haidian District >> >> Beijing University of Posts and Telecommunications >> >> Beijing, China. >> >> _______________________________________________ >> Wiki-research-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- Zhang Shiyue *Tel*: +86 18801167900 *E-mail*: [email protected], [email protected] State Key Laboratory of Networking and Switching Technology No.10 Xitucheng Road, Haidian District Beijing University of Posts and Telecommunications Beijing, China.
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
