Re: [Wiki-research-l] Research:Anatomy of English Wikipedia Did You Know traffic
Hi Laura and Kerry, One point to remember when comparing views of DYKs with other processes such as GAs is that DYKs get a slot on the mainpage. In that sense they are best compared to in the news items and the Featured Article of the Day. Though I'm pretty sure they don't individually get as many hits as the latter. Longer term the things that one would expect would increase readership would be incoming links, redirects, categories and article completeness. If you add a section to an article covering a new aspect such as this particular hill fort being one of the few homes of a particular orchid or having had a WWII anti aircraft emplacement there in the forties then you can expect to come up in relevant searches and thereby get additional hits. Some of this is straightforward, if something has some alternative names then making sure we have redirects for them will enable more people to find the article. Some is more complex. I'm not sure how far down an article the search engines will go, but I assume that the search engines give most weight to the first paragraph and therefore the lede and the redirects need to contain the words that people are most likely to be searching for when they want to find this article. Jonathan On 3 August 2013 17:31, Laura Hale la...@fanhistory.com wrote: On Saturday, August 3, 2013, Kerry Raymond wrote: Hi, Laura! Hi Kerry. Thanks for the comments. :) I wonder if a variable worth considering is the number of views of the DYK vs the average number of page views of the article(s) (per day/week/month or whatever) promoted by the DYK *before* the publication of the DYK (obviously this can only measured for expanded articles rather than new ones). The hypothesis here is that more popular topics make more popular DYKs. This is actually one of the areas that is worth looking at further. People have attempted to time DYKs to coincide with certain events. TonyTheTiger is actually very good at doing this for some his hooks. It can and sometimes does create tension in the project as people try to get things timed for these events and not everyone wants to oblige them. (One situation that particulary comes to mine is the Kony2012 article at http://en.wikipedia.org/wiki/Kony_2012 where the article was stalled at DYK because a reviewer did not want to time it to coincide with an already large media blitz.) It just would require a lot of subject knowledge to do any indepth research on this topic and looking through T:TDYK to see where things are in the special holding areas often to identify some of these. Another interesting variable is number of page views of the article in the days/weeks/months after the DYK. It would be interesting to know the extent to which DYKs drive additional interest in the topic both in the short term and whether any increase in interest is sustained longer term. I would hypothesize any initial sharp increase during the DYK, with a sharp fall-off after the DYK finishes but with a small sustained elevation. Yes, my casual observation has been that historically, articles get an average page views per month bump after DYK that they do not enjoy with other processes like GA or peer review. (This casual observation and assumption further research would bear it out as likely fact is based on the fact that you have rapid content development other processes do not require, and then subsequent SEO stengthening by appearing on the front page.) I think having looked at the articles the hypothesis is true, but would need a great deal of additional data that you also have two mini traffic bumps prior to appearing at DYK, with the first being from the contributors working on the article, and the second as a result of the DYK review. It would also be interesting to see if articles mentioned in DYKs show any increased edit activity OR the creation of new inbound links to the article in the short or long term, but I am less sure about what is the baseline for comparison (given that a DYK article will have recently been created or expanded, suggesting an abnormally high level of edit activity immediately preceding the DYK). Possible proxies are articles in the same categories? The possible baseline would be new articles that meet DYK articles that do not appear at DYK or conversely comparing the article's editing history in several periods: Before DYK work, during DYK expansion, during DYK review, the day of and the week after DYK review, and the two month period after the DYK. (I had actually considered doing this type of research to look at the contributions and DYK, but it would serve a completely different purpose. Hence, it would need to be retooled. I think this could potentially be one of the strengths of DYK that people fail to consider in that it does give new articles of a slightly higher caliber more eyes and potential contributors from the established editing pool than the
Re: [Wiki-research-l] A wiki search engine
Hi, awesome to see thid move forward. This is solving a major namespace style problem (for the namespace of queries) and I fully support it. Good luck with the work and I would love to help test the beta. Sam. On Aug 4, 2013 12:24 AM, Emilio J. Rodríguez-Posada emi...@gmail.com wrote: Hi all again; After some months, we have the domain for LibreFind[1] and some usable results[2][3] (the bot is running). Also, there is a mailing list[4] and a Google Code project[5]. I would like you can join the brainstorm. We need to establish some policies about how to sort results, bots to check dead links, crawlers to improve the results, and many more. You can request an account for the closed beta. Thanks for your time, emijrp [1] http://www.librefind.org [2] http://www.librefind.org/wiki/Spain [3] http://www.librefind.org/wiki/Edgar_Allan_Poe [4] http://groups.google.com/group/librefind [5] https://code.google.com/p/librefind/ 2012/10/27 emijrp emi...@gmail.com After some tests and usability improvements, I'm going to launch an English alpha version. I still need a cool name for the project, any idea? Stay tunned. 2012/10/23 emijrp emi...@gmail.com Yes, there are some options: (semi)protections, blocks, spam black lists, flaggedrevs, abuse filter and some more. All them are well known MediaWiki features and extensions. Thanks for your interest. 2012/10/23 ENWP Pine deyntest...@hotmail.com I agree that this sounds like an interesting experiment. I hope that you get good faith editors. I worry that you’ll get COI editors playing with the search rankings. Do you have a way in mind to deal with that issue? Pine *From:* emijrp emi...@gmail.com *Sent:* Monday, 22 October, 2012 08:29 *To:* Research into Wikimedia content and communitieswiki-research-l@lists.wikimedia.org *Subject:* [Wiki-research-l] A wiki search engine Hi all; I'm starting a new project, a wiki search engine. It uses MediaWiki, Semantic MediaWiki and other minor extensions, and some tricky templates and bots. I remember Wikia Search and how it failed. It had the mini-article thingy for the introduction, and then a lot of links compiled by a crawler. Also something similar to a social network. My project idea (which still needs a cool name) is different. Althought it uses an introduction and images copied from Wikipedia, and some links from the External links sections, it is only a start. The purpose is that community adds, removes and orders the results for each term, and creates redirects for similar terms to avoid duplicates. Why this? I think that Google PageRank isn't enough. It is frequently abused by farmlinks, SEOs and other people trying to put their websites above. Search Shakira in Google for example. You see 1) Official site, 2) Wikipedia 3) Twitter 4) Facebook, then some videos, some news, some images, Myspace. It wastes 3 or more results in obvious nice sites (WP, TW, FB). The wiki search engine puts these sites in the top, and an introduction and related terms, leaving all the space below to not so obvious but interesting websites. Also, if you search for semantic queries like right-wing newspapers in Google, you won't find real newspapers but people and sites discussing about ring-wing newspapers. Or latex and LaTeX being shown in the same results pages. These issues can be resolved with disambiguation result pages. How we choose which results are above or below? The rules are not fully designed yet, but we can put official sites in the first place, then .gov or .edu domains which are important ones, and later unofficial websites, blogs, giving priority to local language, etc. And reaching consensus. We can control aggresive spam with spam blacklists, semi-protect or protect highly visible pages, and use bots or tools to check changes. It obviously has a CC BY-SA license and results can be exported. I think that this approach is the opposite to Google today. For weird queries like Albert Einstein birthplace we can redirect to the most obvious results page (in this case Albert Einstein) using a hand-made redirect or by software (some little change in MediaWiki). You can check a pretty alpha version here http://www.todogratix.es(only Spanish by now sorry) which I'm feeding with some bots. I think that it is an interesting experiment. I'm open to your questions and feedback. Regards, emijrp -- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/ -- ___ Wiki-research-l mailing list
Re: [Wiki-research-l] Research:Anatomy of English Wikipedia Did You Know traffic
I agree it may be difficult to disentangle the impact of having a DYK and the impact of having a new/expanded article. However if one can view the access logs (I am told there is about 3 months worth available), I think you could tell from the referrer whether the access to the article is coming from the homepage (and anywhere else that DYKs are listed) or elsewhere. So, in theory, it's easy enough to tell which page accesses are coming as a direct result of the DYK. But it's harder to tell what long-term accesses are an indirect result of the DYK as opposed to the normal impacts of a new/expanded article. I think you'd have to do some kind of paired experiment using articles that were DYKed and another similar article that had the same amount of new/expanded development in the same time frame but wasn't DYKed. I don't know the likelihood of such article pairs naturally occurring. It might be that the experiment would have to create its own pairs, e.g. two members of the same rowing team with articles of the same length and general content, one DYKed and one not. Kerry _ From: wiki-research-l-boun...@lists.wikimedia.org [mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of WereSpielChequers Sent: Sunday, 4 August 2013 9:31 PM To: Research into Wikimedia content and communities Cc: Wikimedia Mailing List Subject: Re: [Wiki-research-l] Research:Anatomy of English Wikipedia Did You Know traffic Hi Laura and Kerry, One point to remember when comparing views of DYKs with other processes such as GAs is that DYKs get a slot on the mainpage. In that sense they are best compared to in the news items and the Featured Article of the Day. Though I'm pretty sure they don't individually get as many hits as the latter. Longer term the things that one would expect would increase readership would be incoming links, redirects, categories and article completeness. If you add a section to an article covering a new aspect such as this particular hill fort being one of the few homes of a particular orchid or having had a WWII anti aircraft emplacement there in the forties then you can expect to come up in relevant searches and thereby get additional hits. Some of this is straightforward, if something has some alternative names then making sure we have redirects for them will enable more people to find the article. Some is more complex. I'm not sure how far down an article the search engines will go, but I assume that the search engines give most weight to the first paragraph and therefore the lede and the redirects need to contain the words that people are most likely to be searching for when they want to find this article. Jonathan On 3 August 2013 17:31, Laura Hale la...@fanhistory.com wrote: On Saturday, August 3, 2013, Kerry Raymond wrote: Hi, Laura! Hi Kerry. Thanks for the comments. :) I wonder if a variable worth considering is the number of views of the DYK vs the average number of page views of the article(s) (per day/week/month or whatever) promoted by the DYK *before* the publication of the DYK (obviously this can only measured for expanded articles rather than new ones). The hypothesis here is that more popular topics make more popular DYKs. This is actually one of the areas that is worth looking at further. People have attempted to time DYKs to coincide with certain events. TonyTheTiger is actually very good at doing this for some his hooks. It can and sometimes does create tension in the project as people try to get things timed for these events and not everyone wants to oblige them. (One situation that particulary comes to mine is the Kony2012 article at http://en.wikipedia.org/wiki/Kony_2012 where the article was stalled at DYK because a reviewer did not want to time it to coincide with an already large media blitz.) It just would require a lot of subject knowledge to do any indepth research on this topic and looking through T:TDYK to see where things are in the special holding areas often to identify some of these. Another interesting variable is number of page views of the article in the days/weeks/months after the DYK. It would be interesting to know the extent to which DYKs drive additional interest in the topic both in the short term and whether any increase in interest is sustained longer term. I would hypothesize any initial sharp increase during the DYK, with a sharp fall-off after the DYK finishes but with a small sustained elevation. Yes, my casual observation has been that historically, articles get an average page views per month bump after DYK that they do not enjoy with other processes like GA or peer review. (This casual observation and assumption further research would bear it out as likely fact is based on the fact that you have rapid content development other processes do not require, and then subsequent SEO stengthening by appearing on the front page.) I think having looked at the articles the hypothesis is
Re: [Wiki-research-l] Readable characters vs. size in bytes of articles
(note that I posted this yesterday, but the message bounced due to the attached scatter plot. I just uploaded the plot to commons and re-sent) I just replicated this analysis. I think you might have made some mistakes. I took a random sample of non-redirect articles from English Wikipedia and compared the byte_length (from database) to the content_length (from API, tags and comments stripped). I get a pearson correlation coef of *0.9514766*. See the scatter plot including a linear regression linehttp://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter.correlation.enwiki.png. See also the regress output below. Call: lm(formula = byte_len ~ content_length, data = pages) Residuals: Min 1Q Median 3QMax -38263 -419 82592 37605 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept)-97.40412 72.46523 -1.3440.179 content_length 1.149910.00832 138.210 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2722 on 1998 degrees of freedom Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053 F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: 2.2e-16 On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers werespielchequ...@gmail.com wrote: Hi Fabian, That's interesting. When you say you stripped out the html did you also strip out the other parts of the references? Some citation styles will take up more bytes than others, and citation style is supposed to be consistent at the article level. It would also make a difference whether you included or excluded alt text from readable material as I suspect it is non granular - ie if someone is going to create alt text for one picture in an article they will do so for all pictures. More significantly there is a big difference in standards of referencing , broadly the higher the assessed quality and or the more contentious the article the more references there will be. I would expect that if you factored that in there would be some correlation between readable length and bytes within assessed classes of quality, and the outliers would include some of the controversial articles like Jerusalem (353 references) Hope that helps. Jonathan On 2 August 2013 18:24, Floeck, Fabian (AIFB) fabian.flo...@kit.eduwrote: Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a long or short article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the length of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We curled the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. Veer Teja Vidhya Mandir School, Martin Callanan -- Allthough for the ladder you could argue that expandable template listings are not really main reading content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.flo...@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] WikiSym proceedings available
WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1 Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on Twitter #wikisym #opensym Thanks, Dirk! Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] WikiSym proceedings available
How great. Thanks for the link, and much love for your citations analysis. (please, please follow up with a comparison across languages other than English!) SJ Just arrived in HKG On Sun, Aug 4, 2013 at 9:33 PM, Heather Ford hfor...@gmail.com wrote: WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1 Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on Twitter #wikisym #opensym Thanks, Dirk! Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Samuel Klein @metasj w:user:sj +1 617 529 4266 ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] WikiSym proceedings available
On Aug 5, 2013, at 10:25 AM, Samuel Klein wrote: How great. Thanks for the link, and much love for your citations analysis. (please, please follow up with a comparison across languages other than English! Thanks, SJ :) Yes! Shilad, Dave and I just met in Minneapolis to make plans :) SJ Just arrived in HKG On Sun, Aug 4, 2013 at 9:33 PM, Heather Ford hfor...@gmail.com wrote: WikiSym/OpenSym just began in Hong Kong http://opensym.org/wsos2013/program/day1 Proceedings at http://opensym.org/wsos2013/program/proceedings. Follow on Twitter #wikisym #opensym Thanks, Dirk! Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Samuel Klein @metasj w:user:sj +1 617 529 4266 ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Heather Ford Oxford Internet Institute Doctoral Programme www.ethnographymatters.net @hfordsa on Twitter http://hblog.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l