Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

Andrew Gray Tue, 20 Sep 2016 14:34:26 -0700

Hi all,

I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
labelled "stub"....


(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang <[email protected]> wrote:
> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has
> https://en.wikipedia.org/wiki/Category:Stub_categories and
> https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
> other languages, which could be a good starting point. For some of the
> research we've done on quality, exploiting regularities in the category
> system using database access (in other words, LIKE-queries), is a quick way
> to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking for
> even more thorough classification, grabbing a set and training a classifier
> might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates <[email protected]> wrote:
>>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically different
>> information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <[email protected]> wrote:
>>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian or English), but
>>> the lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  [email protected]

_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

Reply via email to