I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
(Talkpage ratings are even worse for this, but that's another issue.)
On 20 September 2016 at 18:01, Morten Wang <nett...@gmail.com> wrote:
> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
> I'd also see if the category system allows for some straightforward
> retrieval. English has
> https://en.wikipedia.org/wiki/Category:Stub_categories and
> https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
> other languages, which could be a good starting point. For some of the
> research we've done on quality, exploiting regularities in the category
> system using database access (in other words, LIKE-queries), is a quick way
> to grab most articles.
> A combination of both approaches might be a good way. If you're looking for
> even more thorough classification, grabbing a set and training a classifier
> might be the way to go.
> On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote:
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> Probably doesn't translate to CJK languages which have radically different
>> information content per character.
>> ...let us be heard from red core to black sky
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu> wrote:
>>> Hi everyone,
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>> I've found lists for various languages (e.g., Italian or English), but
>>> the lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> Up for a little language game? -- http://www.unfun.me
>>> Wiki-research-l mailing list
>> Wiki-research-l mailing list
> Wiki-research-l mailing list
- Andrew Gray
Wiki-research-l mailing list