Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Andrew Gray
Hi all,

I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
labelled "stub"

(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang <nett...@gmail.com> wrote:
> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has
> https://en.wikipedia.org/wiki/Category:Stub_categories and
> https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
> other languages, which could be a good starting point. For some of the
> research we've done on quality, exploiting regularities in the category
> system using database access (in other words, LIKE-queries), is a quick way
> to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking for
> even more thorough classification, grabbing a set and training a classifier
> might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates <syea...@gmail.com> wrote:
>>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically different
>> information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West <w...@cs.stanford.edu> wrote:
>>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian or English), but
>>> the lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] unique visitors

2016-03-19 Thread Andrew Gray
On 17 March 2016 at 19:40, phoebe ayers <phoebe.w...@gmail.com> wrote:

>> One of the drawbacks is that we
>> can't report on a single total number across all our projects.
>
> Hmm. That's unfortunate for understanding reach -- if nothing else,
> the idea that "half a billion people access Wikipedia" (eg from
> earlier comscore reports) was a PR-friendly way of giving an idea of
> the scale of our readership. But I can see why it would be tricky to
> measure. Since this is the research list: I suspect there's still lots
> to be done in understanding just how multilingual people use different
> language editions of Wikipedia, too.

Building on this question a little: with the information we currently
have, is it actively *wrong* for us to keep using the "half a billion"
figure as a very rough first-order estimate? (Like Phoebe, I think I
keep trotting it out when giving talks). Do the new figures give us
reason to think it's substantially higher or lower than that, or even
not meaningfully answerable?

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Looking for help finding tools to measure UNESCO project

2015-10-06 Thread Andrew Gray
On 6 October 2015 at 14:12, Amir E. Aharoni
<amir.ahar...@mail.huji.ac.il> wrote:
> Thanks for this email.
>
> This raises a wider question: What is the comfortable way to compare the
> coverage of a topic in different languages?
>
> For example, I'd love to see a report that says:
>
> Number of articles about UNESCO cultural heritage:
> English Wikipedia: 1000
> French Wikipedia: 1200
> Hebrew Wikipedia: 742
> etc.
>
> And also to track this over time, so if somebody would work hard on creating
> articles about UNESCO cultural heritage in Hebrew, I'd see a trend graph.

There's two general approaches to this:

a) On Wikidata
b) On the individual wikis

Approach (a) would rely on having a defined set of things in Wikidata
that we can identify. For example, "is a World Heritage Site" would be
easy enough, since we have a property explicitly dealing with WHS
identifiers (and we have 100% coverage in Wikidata). "Is of interest
to UNESCO" is a trickier one - but if you can construct a suitable
Wikidata query...

As Federico notes, for WHS records, we can generate a report like
https://tools.wmflabs.org/mix-n-match/?mode=sitestats=93
(57.4% coverage on hewiki!). No graphs but if you were interested then
you could probably set one up without much work.

b) is more useful for fuzzy groups like "of relevance to UNESCO",
since this is more or less perfect for a category system. However, it
would require examining the category tree for each WP you're
interested in to figure out exactly which categories are relevant, and
then running a script to count those daily.

A.
-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Spam] Re: citations to articles cited on wikipedia?

2015-08-21 Thread Andrew Gray
They did; DOAJ seems to have been the method used to determine whether
a journal was OA or not (which is fair enough).

Andrew.

On 21 August 2015 at 12:50, Federico Leva (Nemo) nemow...@gmail.com wrote:
 Andrew Gray, 20/08/2015 14:21:

 They worked on a journal basis, classing them as OA or not OA.


 Weird, why didn't they just use DOAJ? https://doaj.org/

 Nemo


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] citations to articles cited on wikipedia?

2015-08-20 Thread Andrew Gray
On 20 August 2015 at 06:54, Jane Darnell jane...@gmail.com wrote:
 ...the odds that an open access journal is referenced on
 the English Wikipedia are 47% higher compared to closed access

 Thanks for posting! That's an interesting paper, for all sorts of reasons. I
 read it because I highly doubt that the number is as low as that. There is

I've been meaning to actually go through this paper for a while, and
finally did so this morning :-).

They worked on a journal basis, classing them as OA or not OA. But
this is, in some ways, a very small sample. See, eg/, pp.
http://science-metrix.com/files/science-metrix/publications/d_1.8_sm_ec_dg-rtd_proportion_oa_1996-2013_v11p.pdf,
which suggests that articles in gold OA titles represent less than 15%
of the total amount freely available through various forms.

Given this limitation, it seems quite plausible that the actual
OA:citation correlation is higher on a *per-paper* basis... we just
don't really have the information to be sure.

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-13 Thread Andrew Gray
Hi Dario, Reid,

This seems sensible enough and proposal #3 is clearly the better
approach. An explicit opt-in opt-out mechanism would not be worth the
effort to build and would become yet another ignored preferences
setting after a few weeks...

A couple of thoughts:

* I understand the reasoning for not using do-not-track headers (#4);
however, it feels a bit odd to say they probably don't mean us and
skip them... I can almost guarantee you'll have at least one person
making a vocal fuss about not being able to opt-out without an
account. If we were to honour these headers, would it make a
significant change to the amount of data available? Would it likely
skew it any more than leaving off logged-in users?

* Option 3 does releases one further piece of information over and
above those listed - an approximate ratio of logged in versus
non-logged-in pageviews for a page. I cannot see any particular
problem with doing this (and I can think of a couple of fun things to
use it for) but it's probably worth being aware.

Andrew.

On 13 January 2015 at 07:26, Dario Taraborelli
dtarabore...@wikimedia.org wrote:
 I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los 
 Alamos National Laboratory recently submitted to the Wikimedia Analytics Team 
 aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview 
 data dumps and making them available to the public and the research 
 community. [1]

 Reid and his team spearheaded the use of the public Wikipedia pageview dumps 
 to monitor and forecast the spread of influenza and other diseases, using 
 language as a proxy for location. This proposal describes an aggregation 
 strategy adding a geographical dimension to the existing dumps.

 Feedback on the proposal is welcome on the lists or the project talk page on 
 Meta [3]

 Dario

 [1] 
 https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
 [2] http://dx.doi.org/10.1371/journal.pcbi.1003892
 [3] 
 https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews
 ___
 Analytics mailing list
 analyt...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-13 Thread Andrew Gray
Fair enough - I don't use it, and I think I'd got entirely the wrong
end of the stick on what it's for! If it's intended to stop tracking
by third-party sites then it certainly seems to be of little relevance
here.

(It might be worth clarifying this in the proposal, in case a future
ethics-committee reviewer gets the same misapprehension?)

Andrew.

On 13 January 2015 at 20:24, Aaron Halfaker ahalfa...@wikimedia.org wrote:
 Andrew,

 I think it is reasonable to assume that the Do not track header isn't
 referring to this.

 From http://donottrack.us/ with emphasis added.

 Do Not Track is a technology and policy proposal that enables users to opt
 out of tracking by websites they do not visit, [...]


 Do not track is explicitly for third party tracking.  We are merely
 proposing to count those people who do access our sites.  Note that, in this
 case, we are not interested in obtaining identifiers at all, so the word
 track seems to not apply.

 It seems like we're looking for something like a Do Not Log Anything At
 All header.  I don't believe that such a thing exists -- but if it did I
 think it would be good if we supported it.

 -Aaron

 On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray andrew.g...@dunelm.org.uk
 wrote:

 Hi Dario, Reid,

 This seems sensible enough and proposal #3 is clearly the better
 approach. An explicit opt-in opt-out mechanism would not be worth the
 effort to build and would become yet another ignored preferences
 setting after a few weeks...

 A couple of thoughts:

 * I understand the reasoning for not using do-not-track headers (#4);
 however, it feels a bit odd to say they probably don't mean us and
 skip them... I can almost guarantee you'll have at least one person
 making a vocal fuss about not being able to opt-out without an
 account. If we were to honour these headers, would it make a
 significant change to the amount of data available? Would it likely
 skew it any more than leaving off logged-in users?

 * Option 3 does releases one further piece of information over and
 above those listed - an approximate ratio of logged in versus
 non-logged-in pageviews for a page. I cannot see any particular
 problem with doing this (and I can think of a couple of fun things to
 use it for) but it's probably worth being aware.

 Andrew.

 On 13 January 2015 at 07:26, Dario Taraborelli
 dtarabore...@wikimedia.org wrote:
  I’m sharing a proposal that Reid Priedhorsky and his collaborators at
  Los Alamos National Laboratory recently submitted to the Wikimedia 
  Analytics
  Team aimed at producing privacy-preserving geo-aggregates of Wikipedia
  pageview data dumps and making them available to the public and the 
  research
  community. [1]
 
  Reid and his team spearheaded the use of the public Wikipedia pageview
  dumps to monitor and forecast the spread of influenza and other diseases,
  using language as a proxy for location. This proposal describes an
  aggregation strategy adding a geographical dimension to the existing dumps.
 
  Feedback on the proposal is welcome on the lists or the project talk
  page on Meta [3]
 
  Dario
 
  [1]
  https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
  [2] http://dx.doi.org/10.1371/journal.pcbi.1003892
  [3]
  https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews
  ___
  Analytics mailing list
  analyt...@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/analytics



 --
 - Andrew Gray
   andrew.g...@dunelm.org.uk

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Analytics mailing list
 analyt...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics




-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l