[Wiki-research-l] Fwd: [Analytics] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Giuseppe Profiti
[forwarding my answer from analytics ml, I forgot to subscribe to this list too]

Hi Robert,
one solution may be to use a query on Wikidata to retrieve the name
for the stubs category in all the different languages. Then you could
use a tool like PetScan to retrive all the pages in such categories,
or write your own tool by using either a query on the database or
Mediawiki API.
You can find a sample solution here:
http://paws-public.wmflabs.org/paws-public/3270/Stub%20categories.ipynb

I wrote that thing while on a train, so it may be messy and/or  sub-optimal.
I would like to thank Alex Monk and Yuvi Panda for their help with SQL
on paws today.

Best,
Giuseppe

2016-09-20 11:26 GMT+02:00 Robert West :
> Hi everyone,
>
> Does anyone know if there's a straightforward (ideally language-independent)
> way of identifying stub articles in Wikipedia?
>
> Whatever works is ok, whether it's publicly available data or data
> accessible only on the WMF cluster.
>
> I've found lists for various languages (e.g., Italian or English), but the
> lists are in different formats, so separate code is required for each
> language, which doesn't scale.
>
> I guess in the worst case, I'll have to grep for the respective stub
> templates in the respective wikitext dumps, but even this requires to know
> for each language what the respective template is. So if anyone could point
> me to a list of stub templates in different languages, that would also be
> appreciated.
>
> Thanks!
> Bob
>
> --
> Up for a little language game? -- http://www.unfun.me
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Andrew Gray
Hi all,

I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
labelled "stub"

(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang  wrote:
> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has
> https://en.wikipedia.org/wiki/Category:Stub_categories and
> https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
> other languages, which could be a good starting point. For some of the
> research we've done on quality, exploiting regularities in the category
> system using database access (in other words, LIKE-queries), is a quick way
> to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking for
> even more thorough classification, grabbing a set and training a classifier
> might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates  wrote:
>>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically different
>> information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West  wrote:
>>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian or English), but
>>> the lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Stuart A. Yeates
You _really_ need to exclude markup and include only body text when
measuring stubs. It's not uncommon for mass-produced articles with a  only
one or two sentences of text to approach 1K characters, once you include
maintenance templates, content templates, categories, infobox, references,
etc, etc

cheers
stuart

--
...let us be heard from red core to black sky

On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang  wrote:

> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has https://en.wikipedia.org/
> wiki/Category:Stub_categories and https://en.wikipedia.org/
> wiki/Category:Stubs with quite a lot of links to other languages, which
> could be a good starting point. For some of the research we've done on
> quality, exploiting regularities in the category system using database
> access (in other words, LIKE-queries), is a quick way to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking
> for even more thorough classification, grabbing a set and training a
> classifier might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates  wrote:
>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically
>> different information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian
>>>  or English
>>> ), but the
>>> lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Morten Wang
I don't know of a clean, language-independent way of grabbing all stubs.
Stuart's suggestion is quite sensible, at least for English Wikipedia. When
I last checked a few years ago, the mean length of an English language stub
(on a log-scale) is around 1kB (including all markup), and they're quite
much smaller than any other class.

I'd also see if the category system allows for some straightforward
retrieval. English has
https://en.wikipedia.org/wiki/Category:Stub_categories and
https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
other languages, which could be a good starting point. For some of the
research we've done on quality, exploiting regularities in the category
system using database access (in other words, LIKE-queries), is a quick way
to grab most articles.

A combination of both approaches might be a good way. If you're looking for
even more thorough classification, grabbing a set and training a classifier
might be the way to go.


Cheers,
Morten


On 20 September 2016 at 02:40, Stuart A. Yeates  wrote:

> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
> cutoff. There is weaponised javascript to measure that at en:WP:Did you
> know/DYKcheck
>
> Probably doesn't translate to CJK languages which have radically different
> information content per character.
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
> On Tue, Sep 20, 2016 at 9:26 PM, Robert West  wrote:
>
>> Hi everyone,
>>
>> Does anyone know if there's a straightforward (ideally
>> language-independent) way of identifying stub articles in Wikipedia?
>>
>> Whatever works is ok, whether it's publicly available data or data
>> accessible only on the WMF cluster.
>>
>> I've found lists for various languages (e.g., Italian
>>  or English
>> ), but the
>> lists are in different formats, so separate code is required for each
>> language, which doesn't scale.
>>
>> I guess in the worst case, I'll have to grep for the respective stub
>> templates in the respective wikitext dumps, but even this requires to know
>> for each language what the respective template is. So if anyone could point
>> me to a list of stub templates in different languages, that would also be
>> appreciated.
>>
>> Thanks!
>> Bob
>>
>> --
>> Up for a little language game? -- http://www.unfun.me
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Stuart A. Yeates
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
cutoff. There is weaponised javascript to measure that at en:WP:Did you
know/DYKcheck

Probably doesn't translate to CJK languages which have radically different
information content per character.

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West  wrote:

> Hi everyone,
>
> Does anyone know if there's a straightforward (ideally
> language-independent) way of identifying stub articles in Wikipedia?
>
> Whatever works is ok, whether it's publicly available data or data
> accessible only on the WMF cluster.
>
> I've found lists for various languages (e.g., Italian
>  or English
> ), but the
> lists are in different formats, so separate code is required for each
> language, which doesn't scale.
>
> I guess in the worst case, I'll have to grep for the respective stub
> templates in the respective wikitext dumps, but even this requires to know
> for each language what the respective template is. So if anyone could point
> me to a list of stub templates in different languages, that would also be
> appreciated.
>
> Thanks!
> Bob
>
> --
> Up for a little language game? -- http://www.unfun.me
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Robert West
Hi everyone,

Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian
 or English
), but the lists
are in different formats, so separate code is required for each language,
which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub
templates in the respective wikitext dumps, but even this requires to know
for each language what the respective template is. So if anyone could point
me to a list of stub templates in different languages, that would also be
appreciated.

Thanks!
Bob

-- 
Up for a little language game? -- http://www.unfun.me
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l