Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

2013-07-27 Thread Mark

On 7/27/13 10:29 AM, Denny Vrandečić wrote:

I still would worry, though: our content is increasing linearly, as you
say, but the number of active contributors is not. If we take for granted
that active contributors are the ones who provide quality control for the
articles, this means that since 2006 or so the ratio of content per
contributor is linearly declining, which would mean that our quality would
suffer.



One useful bit of information is what *kind* of editors there are, not 
just the raw numbers..


For example, here is a hypothetical situation, which I think James and 
John are contemplating, which would result in a numerical decline in 
editors-per-article with no real change in actual editorial attention to 
the article:


* Article in 2007, with 19 editors: Initial content written by 1 person, 
moderate expansions from 3 people, copyediting from 5 people, 
vandalism-rollback from 10 people


* Similar article in 2013, with 12 editors: Initial content written by 1 
person, moderate expansions from 3 people, copyediting from 3 people and 
1 typo-fixing bot, vandalism-rollback from 2 people and 2 anti-vandal bots


Basically all that happened in this hypothetical is that two of the 
typo-fixers were replaced by a typo-fixing bot, and 8 rollbacks that 
would've once been done by recent-changes patrollers were instead done 
by a smaller number of anti-vandal bots. Maybe that's not what the 
change looks like, but I don't think the raw edit-count data can tell us 
either way.


I think this is also a potential issue with the definition of active 
users, which is defined as 5 edits/month for "active" and 100 
edits/month for "very active". The latter in particular much more 
heavily favors people who make many smaller edits versus fewer large 
edits. And are there editors contributing substantial amounts of content 
to Wikipedia who don't even hit the lower threshold? One possible group 
are people whose main contribution is to write new articles, and do 
little to no other editing. Some people write offline and then 
contribute a new, well-referenced article in a single edit. If that's 
their only involvement in Wikipedia, they wouldn't be counted as active 
Wikipedians in the numbers, even if they're sending us a steady stream 
of 1-2 new articles/month.


I'm not sure how to best answer those questions automatically. Bytes, as 
James suggests, could be one possible proxy, but in addition to total 
bytes, we could look at the editor level. Has there been a decline in 
"active editors" if we define active editing as changing more than N 
bytes in the encyclopedia in a month, not counting rollbacks? That would 
count people who wrote substantial new articles as active, even if they 
did it in only 1 or 2 edits/month (although on the other hand, it 
wouldn't count people who made 100 rollbacks and no other edits).


Another possibility could be to sample a subset of either articles, or 
of editors, and manually annotate what kind of editing is going on. More 
tedious and would of necessity be on a small subset of the encyclopedia, 
but might avoid papering over things that are obvious when you look at 
them but tend to get lost in big-data analyses.


-Mark

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

2013-07-27 Thread James Salsman
Denny Vrandečić wrote:
>...
> Is the graph  based on actual data?

Yes, the precise sizes for the
dumps.wikimedia.org/enwiki/MMDD/enwiki-MMDD-pages-articles-multistream.xml.bz2
files are:

2012-07-02 9524994664
2012-08-02 9824345489
2012-09-02 9929910893
2012-10-01 10015876877
2012-11-01 10124555675
2012-12-01 10220499338
2013-01-02 10315766966
2013-02-04 10425240648
2013-03-04 10430830645
2013-04-03 10433658645
2013-05-03 10525475953
2013-06-04 10617572833
2013-07-08 10721955835

The byte count approximations from multiplying columns 'E' and 'I'
from http://stats.wikimedia.org/EN/TablesWikipediaEN.htm are at the
end of this message. Again, that data best fits two linear trends,
with a cusp around 2006.

> our content is increasing... but the number of active
> contributors is not.

I'm becoming increasingly convinced that as contributors become more
experienced, they choose to do most of their work logged out. What are
the advantages of using a registered account? Theoretically you can
prove that you made contributions, but as far as I know only one
person so far has ever obtained professional credit for their
contributions (there is a recent thread on wiki-research-l about
this.) What are the disadvantages of using a registered account to
edit? Anyone who opposes an edit politically is likely to examine the
entirety of the editor's contribution history and will all too often
stalk, punish by reverting old edits, or dispute the contributor's
work. Anonymous IP editors rarely face such time wasting scrutiny and
hassles. For anyone whose primary goal is to build an encyclopedia as
opposed to socializing, amassing administrative power, or obtaining a
job with the Foundation, the choice is obvious.  Those who wish their
contributions to be remembered for posterity are more likely to become
serial puppeteers than registered editors, unless they want to spend
most of their time being hassled in article space.

John Vandenberg wrote:
>...
> I would love to see stats about quality rather than quantity

It would be a mistake to rely on volunteer or Foundation assessments
of quality, because the likelihood that they would be biased is far to
great. We should rely only on third party assessments of article
quality, such as those in
http://en.wikipedia.org/wiki/Reliability_of_Wikipedia#Assessments
nearly all of which show continuous ongoing improvement.

Automatic measures of quality proposed so far have not really
impressed me, but I think http://arxiv.org/pdf/1206.2517.pdf has huge
potential and I am confident that the ideas it promotes will be easily
automated by bots after it is proven through peer review.

> Does anyone have stats for the number of blocked users per month

Yes, but it's almost meaningless because the vast majority of blocks
are for persistent vandalism, often at schools or libraries where we
really have no way to determine whether the editors involved ever
returned to do productive work.

---

Products of columns 'E' and 'I' from
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm :

Jan-10 1133050
Dec-09 1126230
Nov-09 1120650
Oct-09 1078800
Sep-09 1072500
Aug-09 1065300
Jul-09 1026310
Jun-09 1021380
May-09 979160
Apr-09 971880
Mar-09 932850
Feb-09 930150
Jan-09 925020
Dec-08 885560
Nov-08 880620
Oct-08 841500
Sep-08 837500
Aug-08 831750
Jul-08 796080
Jun-08 794160
May-08 755780
Apr-08 749800
Mar-08 711260
Feb-08 706860
Jan-08 673890
Dec-07 669900
Nov-07 631800
Oct-07 625600
Sep-07 585960
Aug-07 582350
Jul-07 549900
Jun-07 518160
May-07 514080
Apr-07 479360
Mar-07 472480
Feb-07 466240
Jan-07 432000
Dec-06 425700
Nov-06 391720
Oct-06 387100
Sep-06 355160
Aug-06 351000
Jul-06 319560
Jun-06 289630
May-06 285670
Apr-06 255700
Mar-06 2476177000
Feb-06 2312907000
Jan-06 2170049000
Dec-05 201360
Nov-05 1869076000
Oct-05 174696
Sep-05 1627864000
Aug-05 1526784000
Jul-05 1407976000
Jun-05 1300334000
May-05 1209984000
Apr-05 1002925000
Mar-05 92463
Feb-05 87232
Jan-05 838272000
Dec-04 861724000
Nov-04 806195000
Oct-04 743904000
Sep-04 689924000
Aug-04 644502000
Jul-04 595665000
Jun-04 55290
May-04 511038000
Apr-04 47675
Mar-04 440286000
Feb-04 40301
Jan-04 375536000
Dec-03 350336000
Nov-03 329219000
Oct-03 310616000
Sep-03 294689000
Aug-03 27863
Jul-03 261555000
Jun-03 244454000
May-03 230328000
Apr-03 21720
Mar-03 20463
Feb-03 193475000
Jan-03 182936000
Dec-02 17101
Nov-02 16215
Oct-02 15048
Sep-02 80733000
Aug-02 6699
Jul-02 59755000
Jun-02 5542
May-02 49259000
Apr-02 4779
Mar-02 44968000
Feb-02 3935
Jan-02 30582000
Dec-01 26832000
Nov-01 21994000
Oct-01 17244000
Sep-01 10982000
Aug-01 710
Jul-01 4186000
Jun-01 324
May-01 2373600
Apr-01 1295800
Mar-01 596904
Feb-01 186636
Jan-01 33800

__

Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

2013-07-27 Thread John Vandenberg
On Sat, Jul 27, 2013 at 6:29 PM, Denny Vrandečić
 wrote:
> Thank you for the observation.
>
> Is the graph  based on actual data? Because
> it looks just tad bit too linear to me. (I do not disagree with the
> finding, just wondering about the graph itself).
>
> I still would worry, though: our content is increasing linearly, as you
> say, but the number of active contributors is not. If we take for granted
> that active contributors are the ones who provide quality control for the
> articles, this means that since 2006 or so the ratio of content per
> contributor is linearly declining, which would mean that our quality would
> suffer.

There are a few parts of this that I dont think it can be taken for
granted, and I would love to see stats about quality rather than
quantity, as you're talking about quality, and that should be a
significant component of our analysis.

1) 'active contributors are the ones who provide quality control'

   bots do a lot of what used to be done by humans back in 2007,
rolling back most silly edits.
   and it is a small subset of active contributors who do the majority
of the maintenance.

2) the number of active contributors _doing quality control_ has declined.

   we know the number of overall editors is declining, and I think you
are right that those doing quality control is declining, but is there
evidence to support it?  And does it support that this decline is a
problem?

My gut feeling is that the decline in 'quality control' edits is
tightly linked to the increase in bots doing quality control.

i.e. do we have research to support total article-to-editor ratio
having a bearing on average quality of content?
A proxy could be average number of references per article ..?

It seems unlikely, as our content over the last five years has
increased in quality, and our number of editors has declined.

> I see two effects to counter that:
>
> 1) as you already mentioned, contributors are getting increasingly more
> experienced and more effective in fulfilling their tasks.
>
> 2) we continue to have a strong increase in readers and even stronger in
> pageviews (i.e. more and more people consult Wikipedia more and more). They
> probably also provide a layer of quality assurance, even though they might
> not qualify to be counted as active contributors.
>
> I have the gut feeling that 1) cannot be sufficient, and I would be curious
> in the effects of 2) - especially considering that much of the Foundation
> development work can be considered in improving 2 further (visual editor,
> article rating, mobile editing, etc.)

I agree with James that (1) is significant, and (2 - 'the future')
brings many unknowns with it.

(1) consists of our entire potential editor base, which includes of
all our currently active editors, and all of our inactive editors who
are able to resume editing at any time - i.e. not blocked, not ^&%ed
off, etc.  They all know the syntax, and have demonstrated their
commitment to the vision, _and_ the writers have a personal connection
to the articles that they worked on.  I see lots of them come back
occasionally to touch up or expand their work.

(2) brings different editors, for good or ill.  There are some
concerns in the community that simplifying editing will bring more
non-trivial vandalism that bots cant handle, and even more good
meaning editors who are discouraged when they can't understand why
their edit has disappeared, because they dont read the history, the
talk pages, etc, etc.  The ratio of experienced editor to newbie could
be a significant factor in the maintenance of a friendly environment.

More is not always better.

Don't get me wrong; a good VE will be very helpful, and the projects
defensive mechanisms will adapt.  But I predict that if we see lots of
poor quality articles from VE, without adequate references, and the
community backlogs become problematic, the community will want develop
tools to limit new poor quality articles.

Does anyone have stats for the number of blocked users per month over
the years, as that is hurting our potential editor base, and number of
reverts of edits by new users.

--
John Vandenberg

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

2013-07-27 Thread Denny Vrandečić
Thank you for the observation.

Is the graph  based on actual data? Because
it looks just tad bit too linear to me. (I do not disagree with the
finding, just wondering about the graph itself).

I still would worry, though: our content is increasing linearly, as you
say, but the number of active contributors is not. If we take for granted
that active contributors are the ones who provide quality control for the
articles, this means that since 2006 or so the ratio of content per
contributor is linearly declining, which would mean that our quality would
suffer.

I see two effects to counter that:

1) as you already mentioned, contributors are getting increasingly more
experienced and more effective in fulfilling their tasks.

2) we continue to have a strong increase in readers and even stronger in
pageviews (i.e. more and more people consult Wikipedia more and more). They
probably also provide a layer of quality assurance, even though they might
not qualify to be counted as active contributors.

I have the gut feeling that 1) cannot be sufficient, and I would be curious
in the effects of 2) - especially considering that much of the Foundation
development work can be considered in improving 2 further (visual editor,
article rating, mobile editing, etc.)





2013/7/27 James Salsman 

> MZMcBride wrote:
> >... the number of non-deleted revisions per day for the
> > English Wikipedia. The results are here:
> > https://en.wikipedia.org/wiki/Special:Permalink/565971356
>
> So, that looks terrible: http://i.imgur.com/Z9lYCWj.png
>
> It looks terrible in the same way that every other graph of active
> users and several other related measures look like.
>
> But it isn't. It doesn't account for the power law of practice which
> causes everyone who has ever edited Wikipedia to get better at it with
> time. And since so many IP editors are obviously returning, that means
> a lot more than under the false but very common assumption that every
> IP editor is new.
>
> Here's what really matters, articlespace size:
> http://i.imgur.com/TfaD99V.png
>
> The size of the article text in bytes has been marching on linearly
> since the beginning of Wikipedia, with extremely low variation, just
> like the short popular vital articles and every other measure of
> quality content.
>
> There is no legitimate basis to worry about anything until the linear
> trend of the total article bytes breaks out of its 12 year linear
> trend.
>
> (If you multiply columns 'E' and 'I' from
> http://stats.wikimedia.org/EN/TablesWikipediaEN.htm the database size
> shows a cusp at around 2006, corresponding to the growth modes, but
> two separate linear trends fit both modes far better than any growth
> model fits the entire curve.)
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 




-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] article bytes more meaningful than users or revisions (was Re: Updates on VE data analysis)

2013-07-26 Thread James Salsman
MZMcBride wrote:
>... the number of non-deleted revisions per day for the
> English Wikipedia. The results are here:
> https://en.wikipedia.org/wiki/Special:Permalink/565971356

So, that looks terrible: http://i.imgur.com/Z9lYCWj.png

It looks terrible in the same way that every other graph of active
users and several other related measures look like.

But it isn't. It doesn't account for the power law of practice which
causes everyone who has ever edited Wikipedia to get better at it with
time. And since so many IP editors are obviously returning, that means
a lot more than under the false but very common assumption that every
IP editor is new.

Here's what really matters, articlespace size:  http://i.imgur.com/TfaD99V.png

The size of the article text in bytes has been marching on linearly
since the beginning of Wikipedia, with extremely low variation, just
like the short popular vital articles and every other measure of
quality content.

There is no legitimate basis to worry about anything until the linear
trend of the total article bytes breaks out of its 12 year linear
trend.

(If you multiply columns 'E' and 'I' from
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm the database size
shows a cusp at around 2006, corresponding to the growth modes, but
two separate linear trends fit both modes far better than any growth
model fits the entire curve.)

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,