You _really_ need to exclude markup and include only body text when
measuring stubs. It's not uncommon for mass-produced articles with a  only
one or two sentences of text to approach 1K characters, once you include
maintenance templates, content templates, categories, infobox, references,
etc, etc

cheers
stuart

--
...let us be heard from red core to black sky

On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang <nett...@gmail.com> wrote:

> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has https://en.wikipedia.org/
> wiki/Category:Stub_categories and https://en.wikipedia.org/
> wiki/Category:Stubs with quite a lot of links to other languages, which
> could be a good starting point. For some of the research we've done on
> quality, exploiting regularities in the category system using database
> access (in other words, LIKE-queries), is a quick way to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking
> for even more thorough classification, grabbing a set and training a
> classifier might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote:
>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically
>> different information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian
>>> <https://it.wikipedia.org/wiki/Categoria:Stub> or English
>>> <https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the
>>> lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to