On 17/09/13 13:59, Jon Robson wrote: > I would suggest taking a look at the number of 404s caused by people trying > to access pages without the wiki prefix.... This would be interesting data > to go alongside this interesting proposal...
There are lots of different sorts of 404s, so it's necessary to do some filtering. For example: * double-slashes, due to bug 52253 * sitemap.xml * Apple touch icons * bullet.gif in various directories * vulnerability scanning, e.g. xmlrpc.php * BlueCoat verify/notify, as described in <http://www.webmasterworld.com/search_engine_spiders/3859463.htm> * Serial numbers like http://en.wikipedia.org/B008NAYASM . I filtered out everything with a dot or slash in the prospective article title, as well as the BlueCoat URLs and the UAs responsible for serial number URLs. To simplify analysis, I took log lines from the English Wikipedia only. Most of the remaining log entries were search engine crawlers, so I took those out too. The result was 149 log entries at a 1/1000 sample rate, for the week of September 8-14, implying a request rate of about 639,000 per month. This is about 0.006% of the English Wikipedia's page view rate. The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html -- Tim Starling _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
