Re: [Wiki-research-l] Research:Anatomy of English Wikipedia Did You Know traffic

2013-08-04 Thread WereSpielChequers
Hi Laura and Kerry,

One point to remember when comparing views of DYKs with other processes
such as GAs is that DYKs get a slot on the mainpage. In that sense they are
best compared to in the news items and the Featured Article of the Day.
Though I'm pretty sure they don't individually get as many hits as the
latter.

Longer term the things that one would expect would increase readership
would be incoming links, redirects, categories and article completeness. If
you add a section to an article covering a new aspect such as this
particular hill fort being one of the few homes of a particular orchid or
having had a WWII anti aircraft emplacement there in the forties then you
can expect to come up in relevant searches and thereby get additional hits.

Some of this is straightforward, if something has some alternative names
then making sure we have redirects for them will enable more people to find
the article.

Some is more complex. I'm not sure how far down an article the search
engines will go, but I assume that the search engines give most weight to
the first paragraph and therefore the lede and the redirects need to
contain the words that people are most likely to be searching for when they
want to find this article.

Jonathan


On 3 August 2013 17:31, Laura Hale la...@fanhistory.com wrote:



 On Saturday, August 3, 2013, Kerry Raymond wrote:

 Hi, Laura!


 Hi Kerry.  Thanks for the comments. :)


 I wonder if a variable worth considering is the number of views of the
 DYK vs the average number of page views of the article(s) (per
 day/week/month or whatever) promoted by the DYK *before* the publication of
 the DYK (obviously this can only measured for expanded articles rather than
 new ones). The hypothesis here is that more popular topics make more
 popular DYKs.


 This is actually one of the areas that is worth looking at further.
  People have attempted to time DYKs to coincide with certain events.
  TonyTheTiger is actually very good at doing this for some his hooks.  It
 can and sometimes does create tension in the project as people try to get
 things timed for these events and not everyone wants to oblige them.  (One
 situation that particulary comes to mine is the Kony2012 article at
 http://en.wikipedia.org/wiki/Kony_2012 where the article was stalled at
 DYK because a reviewer did not want to time it to coincide with an already
 large media blitz.)  It just would require a lot of subject knowledge to
 do any indepth research on this topic and looking through T:TDYK to see
 where things are in the special holding areas often to identify some of
 these.


 Another interesting variable is number of page views of the article in
 the days/weeks/months after the DYK. It would be interesting to know the
 extent to which DYKs drive additional interest in the topic both in the
 short term and whether any increase in interest is sustained longer term. I
 would hypothesize any initial sharp increase during the DYK, with a sharp
 fall-off after the DYK finishes but with a small sustained elevation.


 Yes, my casual observation has been that historically, articles get an
 average page views per month bump after DYK that they do not enjoy with
 other processes like GA or peer review.  (This casual observation and
 assumption further research would bear it out as likely fact is based on
 the fact that you have rapid content development other processes do not
 require, and then subsequent SEO stengthening by appearing on the front
 page.)  I think having looked at the articles the hypothesis is true, but
 would need a great deal of additional data that you also have two mini
 traffic bumps prior to appearing at DYK, with the first being from the
 contributors working on the article, and the second as a result of the DYK
 review.


 It would also be interesting to see if articles mentioned in DYKs show
 any increased edit activity OR the creation of new inbound links to the
 article in the short or long term, but I am less sure about what is the
 baseline for comparison (given that a DYK article will have recently been
 created or expanded, suggesting an abnormally high level of edit activity
 immediately preceding the DYK). Possible proxies are articles in the same
 categories?


 The possible baseline would be new articles that meet DYK articles that do
 not appear at DYK or conversely comparing the article's editing history in
 several periods: Before DYK work, during DYK expansion, during DYK review,
 the day of and the week after DYK review, and the two month period after
 the DYK.  (I had actually considered doing this type of research to look at
 the contributions and DYK, but it would serve a completely different
 purpose.  Hence, it would need to be retooled.  I think this could
 potentially be one of the strengths of DYK that people fail to consider in
 that it does give new articles of a slightly higher caliber more eyes and
 potential contributors from the established editing pool than the 

Re: [Wiki-research-l] A wiki search engine

2013-08-04 Thread Samuel Klein
Hi, awesome to see thid move forward.  This is solving a major namespace
style problem (for the namespace of queries) and I fully support it.  Good
luck with the work and I would love to help test the beta.

Sam.
On Aug 4, 2013 12:24 AM, Emilio J. Rodríguez-Posada emi...@gmail.com
wrote:

 Hi all again;

 After some months, we have the domain for LibreFind[1] and some usable
 results[2][3] (the bot is running). Also, there is a mailing list[4] and a
 Google Code project[5].

 I would like you can join the brainstorm. We need to establish some
 policies about how to sort results, bots to check dead links, crawlers to
 improve the results, and many more. You can request an account for the
 closed beta.

 Thanks for your time,
 emijrp

 [1] http://www.librefind.org
 [2] http://www.librefind.org/wiki/Spain
 [3] http://www.librefind.org/wiki/Edgar_Allan_Poe
 [4] http://groups.google.com/group/librefind
 [5] https://code.google.com/p/librefind/

 2012/10/27 emijrp emi...@gmail.com

 After some tests and usability improvements, I'm going to launch an
 English alpha version.

 I still need a cool name for the project, any idea?

 Stay tunned.


 2012/10/23 emijrp emi...@gmail.com

 Yes, there are some options: (semi)protections, blocks, spam black
 lists, flaggedrevs, abuse filter and some more. All them are well known
 MediaWiki features and extensions.

 Thanks for your interest.


 2012/10/23 ENWP Pine deyntest...@hotmail.com


 I agree that this sounds like an interesting experiment. I hope that
 you get good faith editors. I worry that you’ll get COI editors playing
 with the search rankings. Do you have a way in mind to deal with that 
 issue?

 Pine

  *From:* emijrp emi...@gmail.com
 *Sent:* Monday, 22 October, 2012 08:29
 *To:* Research into Wikimedia content and 
 communitieswiki-research-l@lists.wikimedia.org
 *Subject:* [Wiki-research-l] A wiki search engine

 Hi all;

 I'm starting a new project, a wiki search engine. It uses MediaWiki,
 Semantic MediaWiki and other minor extensions, and some tricky templates
 and bots.

 I remember Wikia Search and how it failed. It had the mini-article
 thingy for the introduction, and then a lot of links compiled by a crawler.
 Also something similar to a social network.

 My project idea (which still needs a cool name) is different. Althought
 it uses an introduction and images copied from Wikipedia, and some links
 from the External links sections, it is only a start. The purpose is that
 community adds, removes and orders the results for each term, and creates
 redirects for similar terms to avoid duplicates.

 Why this? I think that Google PageRank isn't enough. It is frequently
 abused by farmlinks, SEOs and other people trying to put their websites
 above.

 Search Shakira in Google for example. You see 1) Official site, 2)
 Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images,
 Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB).
 The wiki search engine puts these sites in the top, and an introduction and
 related terms, leaving all the space below to not so obvious but
 interesting websites. Also, if you search for semantic queries like
 right-wing newspapers in Google, you won't find real newspapers but
 people and sites discussing about ring-wing newspapers. Or latex and
 LaTeX being shown in the same results pages. These issues can be resolved
 with disambiguation result pages.

 How we choose which results are above or below? The rules are not fully
 designed yet, but we can put official sites in the first place, then .gov
 or .edu domains which are important ones, and later unofficial websites,
 blogs, giving priority to local language, etc. And reaching consensus.

 We can control aggresive spam with spam blacklists, semi-protect or
 protect highly visible pages, and use bots or tools to check changes.

 It obviously has a CC BY-SA license and results can be exported. I
 think that this approach is the opposite to Google today.

 For weird queries like Albert Einstein birthplace we can redirect to
 the most obvious results page (in this case Albert Einstein) using a
 hand-made redirect or by software (some little change in MediaWiki).

 You can check a pretty alpha version here http://www.todogratix.es(only 
 Spanish by now sorry) which I'm feeding with some bots.

 I think that it is an interesting experiment. I'm open to your
 questions and feedback.

 Regards,
 emijrp

 --
 Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
 Pre-doctoral student at the University of Cádiz (Spain)
 Projects: AVBOT http://code.google.com/p/avbot/ | 
 StatMediaWikihttp://statmediawiki.forja.rediris.es
 | WikiEvidens http://code.google.com/p/wikievidens/ | 
 WikiPapershttp://wikipapers.referata.com
 | WikiTeam http://code.google.com/p/wikiteam/
 Personal website: https://sites.google.com/site/emijrp/

  --
 ___
 Wiki-research-l mailing list
 

Re: [Wiki-research-l] Research:Anatomy of English Wikipedia Did You Know traffic

2013-08-04 Thread Kerry Raymond
I agree it may be difficult to disentangle the impact of having a DYK and
the impact of having a new/expanded article. However if one can view the
access logs (I am told there is about 3 months worth available), I think you
could tell from the referrer whether the access to the article is coming
from the homepage (and anywhere else that DYKs are listed) or elsewhere. So,
in theory, it's easy enough to tell which page accesses are coming as a
direct result of the DYK. But it's harder to tell what long-term accesses
are an indirect result of the DYK as opposed to the normal impacts of a
new/expanded article. I think you'd have to do some kind of paired
experiment using articles that were DYKed and another similar article that
had the same amount of new/expanded development in the same time frame but
wasn't DYKed. I don't know the likelihood of such article pairs naturally
occurring. It might be that the experiment would have to create its own
pairs, e.g. two members of the same rowing team with articles of the same
length and general content, one DYKed and one not.

 

Kerry

 

  _  

From: wiki-research-l-boun...@lists.wikimedia.org
[mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of
WereSpielChequers
Sent: Sunday, 4 August 2013 9:31 PM
To: Research into Wikimedia content and communities
Cc: Wikimedia Mailing List
Subject: Re: [Wiki-research-l] Research:Anatomy of English Wikipedia Did You
Know traffic

 

Hi Laura and Kerry,

One point to remember when comparing views of DYKs with other processes such
as GAs is that DYKs get a slot on the mainpage. In that sense they are best
compared to in the news items and the Featured Article of the Day. Though
I'm pretty sure they don't individually get as many hits as the latter.

Longer term the things that one would expect would increase readership would
be incoming links, redirects, categories and article completeness. If you
add a section to an article covering a new aspect such as this particular
hill fort being one of the few homes of a particular orchid or having had a
WWII anti aircraft emplacement there in the forties then you can expect to
come up in relevant searches and thereby get additional hits.

Some of this is straightforward, if something has some alternative names
then making sure we have redirects for them will enable more people to find
the article.

Some is more complex. I'm not sure how far down an article the search
engines will go, but I assume that the search engines give most weight to
the first paragraph and therefore the lede and the redirects need to contain
the words that people are most likely to be searching for when they want to
find this article.

Jonathan
 

On 3 August 2013 17:31, Laura Hale la...@fanhistory.com wrote:



On Saturday, August 3, 2013, Kerry Raymond wrote:

Hi, Laura!

 

 

Hi Kerry.  Thanks for the comments. :)

 

I wonder if a variable worth considering is the number of views of the DYK
vs the average number of page views of the article(s) (per day/week/month or
whatever) promoted by the DYK *before* the publication of the DYK (obviously
this can only measured for expanded articles rather than new ones). The
hypothesis here is that more popular topics make more popular DYKs.

 

 

This is actually one of the areas that is worth looking at further.  People
have attempted to time DYKs to coincide with certain events.  TonyTheTiger
is actually very good at doing this for some his hooks.  It can and
sometimes does create tension in the project as people try to get things
timed for these events and not everyone wants to oblige them.  (One
situation that particulary comes to mine is the Kony2012 article at
http://en.wikipedia.org/wiki/Kony_2012 where the article was stalled at DYK
because a reviewer did not want to time it to coincide with an already large
media blitz.)  It just would require a lot of subject knowledge to do any
indepth research on this topic and looking through T:TDYK to see where
things are in the special holding areas often to identify some of these.

 

Another interesting variable is number of page views of the article in the
days/weeks/months after the DYK. It would be interesting to know the extent
to which DYKs drive additional interest in the topic both in the short term
and whether any increase in interest is sustained longer term. I would
hypothesize any initial sharp increase during the DYK, with a sharp fall-off
after the DYK finishes but with a small sustained elevation.

 

 

Yes, my casual observation has been that historically, articles get an
average page views per month bump after DYK that they do not enjoy with
other processes like GA or peer review.  (This casual observation and
assumption further research would bear it out as likely fact is based on the
fact that you have rapid content development other processes do not require,
and then subsequent SEO stengthening by appearing on the front page.)  I
think having looked at the articles the hypothesis is 

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

2013-08-04 Thread Aaron Halfaker
(note that I posted this yesterday, but the message bounced due to the
attached scatter plot.  I just uploaded the plot to commons and re-sent)

I just replicated this analysis.  I think you might have made some
mistakes.

I took a random sample of non-redirect articles from English Wikipedia and
compared the byte_length (from database) to the content_length (from API,
tags and comments stripped).

I get a pearson correlation coef of *0.9514766*.

See the scatter plot including a linear regression
linehttp://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter.correlation.enwiki.png.
 See also the regress output below.

Call:
lm(formula = byte_len ~ content_length, data = pages)

Residuals:
   Min 1Q Median 3QMax
-38263   -419 82592  37605

Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept)-97.40412   72.46523  -1.3440.179
content_length   1.149910.00832 138.210   2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2722 on 1998 degrees of freedom
Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
F-statistic: 1.91e+04 on 1 and 1998 DF,  p-value:  2.2e-16


On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers 
werespielchequ...@gmail.com wrote:

 Hi Fabian,

 That's interesting. When you say you stripped out the html did you also
 strip out the other parts of the references? Some citation styles will take
 up more bytes than others, and citation style is supposed to be consistent
 at the article level.

 It would also make a difference whether you included or excluded alt text
 from readable material as I suspect it is non granular - ie if someone is
 going to create alt text for one picture in an article they will do so for
 all pictures.

 More significantly there is a big difference in standards of referencing ,
 broadly the higher the assessed quality and or the more contentious the
 article the more references there will be.

 I would expect that if you factored that in there would be some
 correlation between readable length and bytes within assessed classes of
 quality, and the outliers would include some of the controversial articles
 like Jerusalem (353 references)

 Hope that helps.

 Jonathan


 On 2 August 2013 18:24, Floeck, Fabian (AIFB) fabian.flo...@kit.eduwrote:

 Hi,
 to whoever is interested in this (and I hope I didn't just repeat someone
 else's experiments on this):

 I wanted to know if a long or short article in terms of how much
 readable material (excluding pictures) is presented to the reader in the
 front-end is correlated to the byte size of the Wikisyntax which can be
 obtained from the DB or API; as people often define the length of an
 article by its length in bytes.

 TL;DR: Turns out size in bytes is a really, really bad indicator for the
 actual, readable content of a Wikipedia article, even worse than I thought.

 We curled the front-end HTML of all articles of the English Wikipedia
 (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as
 around 5900 bytes is the total en.wiki average for these articles). = 41981
 articles.
 Results for size in characters (w/ whitespaces) after cleaning the HTML
 out:
 Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748

 Especially the gap between Min and Max was interesting. But templates
 make it possible.
 (See e.g. Veer Teja Vidhya Mandir School, Martin Callanan --
 Allthough for the ladder you could argue that expandable template listings
 are not really main reading content..)

 Effectively, correlation for readable character size with byte size =
 0.04 (i.e. none) in the sample.

 If someone already did this or a similar analysis, I'd appreciate
 pointers.

 Best,

 Fabian




 --
 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Fax: +49 721 608 4 6580
 Skype: f.floeck_work
 E-Mail: fabian.flo...@kit.edu
 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT – University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] WikiSym proceedings available

2013-08-04 Thread Heather Ford
WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1

Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on 
Twitter #wikisym #opensym

Thanks, Dirk!







Heather Ford 
Oxford Internet Institute Doctoral Programme 
www.ethnographymatters.net 
@hfordsa on Twitter
http://hblog.org 

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] WikiSym proceedings available

2013-08-04 Thread Samuel Klein
How great. Thanks for the link, and much love for your citations
analysis.  (please, please follow up with a comparison across
languages other than English!)

SJ
Just arrived in HKG

On Sun, Aug 4, 2013 at 9:33 PM, Heather Ford hfor...@gmail.com wrote:
 WikiSym/OpenSym just began in Hong Kong
 http://opensym.org/wsos2013/program/day1

 Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on
 Twitter #wikisym #opensym

 Thanks, Dirk!







 Heather Ford
 Oxford Internet Institute Doctoral Programme
 www.ethnographymatters.net
 @hfordsa on Twitter
 http://hblog.org


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Samuel Klein  @metasj   w:user:sj  +1 617 529 4266

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] WikiSym proceedings available

2013-08-04 Thread Heather Ford

On Aug 5, 2013, at 10:25 AM, Samuel Klein wrote:

 How great. Thanks for the link, and much love for your citations
 analysis.  (please, please follow up with a comparison across
 languages other than English!

Thanks, SJ :) Yes! Shilad, Dave and I just met in Minneapolis to make plans :)

 SJ
 Just arrived in HKG
 
 On Sun, Aug 4, 2013 at 9:33 PM, Heather Ford hfor...@gmail.com wrote:
 WikiSym/OpenSym just began in Hong Kong
 http://opensym.org/wsos2013/program/day1
 
 Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on
 Twitter #wikisym #opensym
 
 Thanks, Dirk!
 
 
 
 
 
 
 
 Heather Ford
 Oxford Internet Institute Doctoral Programme
 www.ethnographymatters.net
 @hfordsa on Twitter
 http://hblog.org
 
 
 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 
 
 
 
 -- 
 Samuel Klein  @metasj   w:user:sj  +1 617 529 4266
 
 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Heather Ford 
Oxford Internet Institute Doctoral Programme 
www.ethnographymatters.net 
@hfordsa on Twitter
http://hblog.org 

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l