Hi all, I'd strongly caution against using the stub categories without *also* doing some kind of filtering on size. There's a real problem with "stub lag" - articles get tagged, incrementally improve, no-one thinks they've done enough to justify removing the tag (or notices the tag is there, or thinks they're allowed to remove it)... and you end up with a lot of multi-section pages a good hundred words of text still labelled "stub"....
(Talkpage ratings are even worse for this, but that's another issue.) Andrew. On 20 September 2016 at 18:01, Morten Wang <nett...@gmail.com> wrote: > I don't know of a clean, language-independent way of grabbing all stubs. > Stuart's suggestion is quite sensible, at least for English Wikipedia. When > I last checked a few years ago, the mean length of an English language stub > (on a log-scale) is around 1kB (including all markup), and they're quite > much smaller than any other class. > > I'd also see if the category system allows for some straightforward > retrieval. English has > https://en.wikipedia.org/wiki/Category:Stub_categories and > https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to > other languages, which could be a good starting point. For some of the > research we've done on quality, exploiting regularities in the category > system using database access (in other words, LIKE-queries), is a quick way > to grab most articles. > > A combination of both approaches might be a good way. If you're looking for > even more thorough classification, grabbing a set and training a classifier > might be the way to go. > > > Cheers, > Morten > > > On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote: >> >> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful >> cutoff. There is weaponised javascript to measure that at en:WP:Did you >> know/DYKcheck >> >> Probably doesn't translate to CJK languages which have radically different >> information content per character. >> >> cheers >> stuart >> >> -- >> ...let us be heard from red core to black sky >> >> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu> wrote: >>> >>> Hi everyone, >>> >>> Does anyone know if there's a straightforward (ideally >>> language-independent) way of identifying stub articles in Wikipedia? >>> >>> Whatever works is ok, whether it's publicly available data or data >>> accessible only on the WMF cluster. >>> >>> I've found lists for various languages (e.g., Italian or English), but >>> the lists are in different formats, so separate code is required for each >>> language, which doesn't scale. >>> >>> I guess in the worst case, I'll have to grep for the respective stub >>> templates in the respective wikitext dumps, but even this requires to know >>> for each language what the respective template is. So if anyone could point >>> me to a list of stub templates in different languages, that would also be >>> appreciated. >>> >>> Thanks! >>> Bob >>> >>> -- >>> Up for a little language game? -- http://www.unfun.me >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- - Andrew Gray andrew.g...@dunelm.org.uk _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l