Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages
Hi all, I'd strongly caution against using the stub categories without *also* doing some kind of filtering on size. There's a real problem with "stub lag" - articles get tagged, incrementally improve, no-one thinks they've done enough to justify removing the tag (or notices the tag is there, or thinks they're allowed to remove it)... and you end up with a lot of multi-section pages a good hundred words of text still labelled "stub" (Talkpage ratings are even worse for this, but that's another issue.) Andrew. On 20 September 2016 at 18:01, Morten Wang <nett...@gmail.com> wrote: > I don't know of a clean, language-independent way of grabbing all stubs. > Stuart's suggestion is quite sensible, at least for English Wikipedia. When > I last checked a few years ago, the mean length of an English language stub > (on a log-scale) is around 1kB (including all markup), and they're quite > much smaller than any other class. > > I'd also see if the category system allows for some straightforward > retrieval. English has > https://en.wikipedia.org/wiki/Category:Stub_categories and > https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to > other languages, which could be a good starting point. For some of the > research we've done on quality, exploiting regularities in the category > system using database access (in other words, LIKE-queries), is a quick way > to grab most articles. > > A combination of both approaches might be a good way. If you're looking for > even more thorough classification, grabbing a set and training a classifier > might be the way to go. > > > Cheers, > Morten > > > On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote: >> >> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful >> cutoff. There is weaponised javascript to measure that at en:WP:Did you >> know/DYKcheck >> >> Probably doesn't translate to CJK languages which have radically different >> information content per character. >> >> cheers >> stuart >> >> -- >> ...let us be heard from red core to black sky >> >> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu> wrote: >>> >>> Hi everyone, >>> >>> Does anyone know if there's a straightforward (ideally >>> language-independent) way of identifying stub articles in Wikipedia? >>> >>> Whatever works is ok, whether it's publicly available data or data >>> accessible only on the WMF cluster. >>> >>> I've found lists for various languages (e.g., Italian or English), but >>> the lists are in different formats, so separate code is required for each >>> language, which doesn't scale. >>> >>> I guess in the worst case, I'll have to grep for the respective stub >>> templates in the respective wikitext dumps, but even this requires to know >>> for each language what the respective template is. So if anyone could point >>> me to a list of stub templates in different languages, that would also be >>> appreciated. >>> >>> Thanks! >>> Bob >>> >>> -- >>> Up for a little language game? -- http://www.unfun.me >>> >>> ___ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] unique visitors
On 17 March 2016 at 19:40, phoebe ayers <phoebe.w...@gmail.com> wrote: >> One of the drawbacks is that we >> can't report on a single total number across all our projects. > > Hmm. That's unfortunate for understanding reach -- if nothing else, > the idea that "half a billion people access Wikipedia" (eg from > earlier comscore reports) was a PR-friendly way of giving an idea of > the scale of our readership. But I can see why it would be tricky to > measure. Since this is the research list: I suspect there's still lots > to be done in understanding just how multilingual people use different > language editions of Wikipedia, too. Building on this question a little: with the information we currently have, is it actively *wrong* for us to keep using the "half a billion" figure as a very rough first-order estimate? (Like Phoebe, I think I keep trotting it out when giving talks). Do the new figures give us reason to think it's substantially higher or lower than that, or even not meaningfully answerable? -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Looking for help finding tools to measure UNESCO project
On 6 October 2015 at 14:12, Amir E. Aharoni <amir.ahar...@mail.huji.ac.il> wrote: > Thanks for this email. > > This raises a wider question: What is the comfortable way to compare the > coverage of a topic in different languages? > > For example, I'd love to see a report that says: > > Number of articles about UNESCO cultural heritage: > English Wikipedia: 1000 > French Wikipedia: 1200 > Hebrew Wikipedia: 742 > etc. > > And also to track this over time, so if somebody would work hard on creating > articles about UNESCO cultural heritage in Hebrew, I'd see a trend graph. There's two general approaches to this: a) On Wikidata b) On the individual wikis Approach (a) would rely on having a defined set of things in Wikidata that we can identify. For example, "is a World Heritage Site" would be easy enough, since we have a property explicitly dealing with WHS identifiers (and we have 100% coverage in Wikidata). "Is of interest to UNESCO" is a trickier one - but if you can construct a suitable Wikidata query... As Federico notes, for WHS records, we can generate a report like https://tools.wmflabs.org/mix-n-match/?mode=sitestats=93 (57.4% coverage on hewiki!). No graphs but if you were interested then you could probably set one up without much work. b) is more useful for fuzzy groups like "of relevance to UNESCO", since this is more or less perfect for a category system. However, it would require examining the category tree for each WP you're interested in to figure out exactly which categories are relevant, and then running a script to count those daily. A. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Spam] Re: citations to articles cited on wikipedia?
They did; DOAJ seems to have been the method used to determine whether a journal was OA or not (which is fair enough). Andrew. On 21 August 2015 at 12:50, Federico Leva (Nemo) nemow...@gmail.com wrote: Andrew Gray, 20/08/2015 14:21: They worked on a journal basis, classing them as OA or not OA. Weird, why didn't they just use DOAJ? https://doaj.org/ Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] citations to articles cited on wikipedia?
On 20 August 2015 at 06:54, Jane Darnell jane...@gmail.com wrote: ...the odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to closed access Thanks for posting! That's an interesting paper, for all sorts of reasons. I read it because I highly doubt that the number is as low as that. There is I've been meaning to actually go through this paper for a while, and finally did so this morning :-). They worked on a journal basis, classing them as OA or not OA. But this is, in some ways, a very small sample. See, eg/, pp. http://science-metrix.com/files/science-metrix/publications/d_1.8_sm_ec_dg-rtd_proportion_oa_1996-2013_v11p.pdf, which suggests that articles in gold OA titles represent less than 15% of the total amount freely available through various forms. Given this limitation, it seems quite plausible that the actual OA:citation correlation is higher on a *per-paper* basis... we just don't really have the information to be sure. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
Hi Dario, Reid, This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks... A couple of thoughts: * I understand the reasoning for not using do-not-track headers (#4); however, it feels a bit odd to say they probably don't mean us and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users? * Option 3 does releases one further piece of information over and above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware. Andrew. On 13 January 2015 at 07:26, Dario Taraborelli dtarabore...@wikimedia.org wrote: I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here. (It might be worth clarifying this in the proposal, in case a future ethics-committee reviewer gets the same misapprehension?) Andrew. On 13 January 2015 at 20:24, Aaron Halfaker ahalfa...@wikimedia.org wrote: Andrew, I think it is reasonable to assume that the Do not track header isn't referring to this. From http://donottrack.us/ with emphasis added. Do Not Track is a technology and policy proposal that enables users to opt out of tracking by websites they do not visit, [...] Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites. Note that, in this case, we are not interested in obtaining identifiers at all, so the word track seems to not apply. It seems like we're looking for something like a Do Not Log Anything At All header. I don't believe that such a thing exists -- but if it did I think it would be good if we supported it. -Aaron On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray andrew.g...@dunelm.org.uk wrote: Hi Dario, Reid, This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks... A couple of thoughts: * I understand the reasoning for not using do-not-track headers (#4); however, it feels a bit odd to say they probably don't mean us and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users? * Option 3 does releases one further piece of information over and above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware. Andrew. On 13 January 2015 at 07:26, Dario Taraborelli dtarabore...@wikimedia.org wrote: I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Analytics mailing list analyt...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l