Hi everyone,
Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?
Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.
I've found lists for various languages (e.g., Italian
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
cutoff. There is weaponised javascript to measure that at en:WP:Did you
know/DYKcheck
Probably doesn't translate to CJK languages which have radically different
information content per character.
cheers
stuart
--
...let us
I don't know of a clean, language-independent way of grabbing all stubs.
Stuart's suggestion is quite sensible, at least for English Wikipedia. When
I last checked a few years ago, the mean length of an English language stub
(on a log-scale) is around 1kB (including all markup), and they're quite
m
You _really_ need to exclude markup and include only body text when
measuring stubs. It's not uncommon for mass-produced articles with a only
one or two sentences of text to approach 1K characters, once you include
maintenance templates, content templates, categories, infobox, references,
etc, etc
Hi all,
I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or th
[forwarding my answer from analytics ml, I forgot to subscribe to this list too]
Hi Robert,
one solution may be to use a query on Wikidata to retrieve the name
for the stubs category in all the different languages. Then you could
use a tool like PetScan to retrive all the pages in such categories,