On 17/09/13 13:59, Jon Robson wrote:
> I would suggest taking a look at the number of 404s caused by people trying
> to access pages without the wiki prefix.... This would be interesting data
> to go alongside this interesting proposal...

There are lots of different sorts of 404s, so it's necessary to do
some filtering. For example:

* double-slashes, due to bug 52253
* sitemap.xml
* Apple touch icons
* bullet.gif in various directories
* vulnerability scanning, e.g. xmlrpc.php
* BlueCoat verify/notify, as described in
<http://www.webmasterworld.com/search_engine_spiders/3859463.htm>
* Serial numbers like http://en.wikipedia.org/B008NAYASM .

I filtered out everything with a dot or slash in the prospective
article title, as well as the BlueCoat URLs and the UAs responsible
for serial number URLs. To simplify analysis, I took log lines from
the English Wikipedia only.

Most of the remaining log entries were search engine crawlers, so I
took those out too.

The result was 149 log entries at a 1/1000 sample rate, for the week
of September 8-14, implying a request rate of about 639,000 per month.
This is about 0.006% of the English Wikipedia's page view rate.

The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to