On 04/11/2012 01:45 AM, Erik Zachte wrote:
Here are some numbers on total bot burden:
1)
http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states
for March 2012:
In total 69.5 M page requests (mime type text/html only!) per day are
considered crawler requests, out of 696 M
My suggestion for how to filter these bots efficiently in c program (no
costly nuanced regexps) before sending data to webstatscollector:
a) Find 14th field in space delimited log line = user agent (but beware of
false delimiters in logs from varnish, if still applicable)
b) Search this
, 2012 9:21 PM
To: Wikimedia developers
Cc: Diederik van Liere; Lars Aronsson
Subject: Re: [Wikitech-l] Page views
2012/4/8 Erik Zachte ezac...@wikimedia.org
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET
On Mon, Apr 9, 2012 at 00:46, Erik Zachte ezac...@wikimedia.org wrote:
returns 20 lines from this 1:1000 sampled squid log file
after removing javascript/json/robots.txt there are 13 left,
which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
Is this the same
Hi Srikanth,
Yes, we are looking into the growth percentages as they seem
unrealistically high.
Best,
Diederik
On Mon, Apr 9, 2012 at 3:30 AM, Srikanth Lakshmanan srik@gmail.com wrote:
On Mon, Apr 9, 2012 at 00:46, Erik Zachte ezac...@wikimedia.org wrote:
returns 20 lines from this
, April 09, 2012 9:28 PM
To: Srikanth Lakshmanan
Cc: Wikimedia developers; Diederik van Liere; Lars Aronsson
Subject: Re: [Wikitech-l] Page views
Hi Srikanth,
Yes, we are looking into the growth percentages as they seem unrealistically
high.
Best,
Diederik
On Mon, Apr 9, 2012 at 3:30 AM, Srikanth
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' |
awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file
after removing javascript/json/robots.txt
2012/4/8 Erik Zachte ezac...@wikimedia.org
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' |
awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log
I'm telling people that the Swedish Wikipedia has 90-100
million page views per month or on average ten per month
per Swedish citizen. This is based on stats.wikimedia.org
(Wikistats), but is it really true? It would be really
embarrassing if it were wrong by some order of magnitude.
There is of