Thanks Tim for running those data. That seems to suggest the URL structure works for the most case.
On Wed, Sep 18, 2013 at 12:07 AM, Tim Starling <tstarl...@wikimedia.org> wrote: > On 17/09/13 13:59, Jon Robson wrote: >> I would suggest taking a look at the number of 404s caused by people trying >> to access pages without the wiki prefix.... This would be interesting data >> to go alongside this interesting proposal... > > There are lots of different sorts of 404s, so it's necessary to do > some filtering. For example: > > * double-slashes, due to bug 52253 > * sitemap.xml > * Apple touch icons > * bullet.gif in various directories > * vulnerability scanning, e.g. xmlrpc.php > * BlueCoat verify/notify, as described in > <http://www.webmasterworld.com/search_engine_spiders/3859463.htm> > * Serial numbers like http://en.wikipedia.org/B008NAYASM . > > I filtered out everything with a dot or slash in the prospective > article title, as well as the BlueCoat URLs and the UAs responsible > for serial number URLs. To simplify analysis, I took log lines from > the English Wikipedia only. > > Most of the remaining log entries were search engine crawlers, so I > took those out too. > > The result was 149 log entries at a 1/1000 sample rate, for the week > of September 8-14, implying a request rate of about 639,000 per month. > This is about 0.006% of the English Wikipedia's page view rate. > > The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html > > -- Tim Starling > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Jon Robson http://jonrobson.me.uk @rakugojon _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l