You _really_ need to exclude markup and include only body text when measuring stubs. It's not uncommon for mass-produced articles with a only one or two sentences of text to approach 1K characters, once you include maintenance templates, content templates, categories, infobox, references, etc, etc
cheers stuart -- ...let us be heard from red core to black sky On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang <[email protected]> wrote: > I don't know of a clean, language-independent way of grabbing all stubs. > Stuart's suggestion is quite sensible, at least for English Wikipedia. When > I last checked a few years ago, the mean length of an English language stub > (on a log-scale) is around 1kB (including all markup), and they're quite > much smaller than any other class. > > I'd also see if the category system allows for some straightforward > retrieval. English has https://en.wikipedia.org/ > wiki/Category:Stub_categories and https://en.wikipedia.org/ > wiki/Category:Stubs with quite a lot of links to other languages, which > could be a good starting point. For some of the research we've done on > quality, exploiting regularities in the category system using database > access (in other words, LIKE-queries), is a quick way to grab most articles. > > A combination of both approaches might be a good way. If you're looking > for even more thorough classification, grabbing a set and training a > classifier might be the way to go. > > > Cheers, > Morten > > > On 20 September 2016 at 02:40, Stuart A. Yeates <[email protected]> wrote: > >> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful >> cutoff. There is weaponised javascript to measure that at en:WP:Did you >> know/DYKcheck >> >> Probably doesn't translate to CJK languages which have radically >> different information content per character. >> >> cheers >> stuart >> >> -- >> ...let us be heard from red core to black sky >> >> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <[email protected]> >> wrote: >> >>> Hi everyone, >>> >>> Does anyone know if there's a straightforward (ideally >>> language-independent) way of identifying stub articles in Wikipedia? >>> >>> Whatever works is ok, whether it's publicly available data or data >>> accessible only on the WMF cluster. >>> >>> I've found lists for various languages (e.g., Italian >>> <https://it.wikipedia.org/wiki/Categoria:Stub> or English >>> <https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the >>> lists are in different formats, so separate code is required for each >>> language, which doesn't scale. >>> >>> I guess in the worst case, I'll have to grep for the respective stub >>> templates in the respective wikitext dumps, but even this requires to know >>> for each language what the respective template is. So if anyone could point >>> me to a list of stub templates in different languages, that would also be >>> appreciated. >>> >>> Thanks! >>> Bob >>> >>> -- >>> Up for a little language game? -- http://www.unfun.me >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
