Totally agree with Nick, I think the most accurate number comes from GA as it executes javascript. We've been battling with bots/crawlers since day 1 to track ow.ly url clicks. We have about 100 filtered IPs, but new ones come in every day, so those 100 IPs are the ones hit our servers the most. The challenge is that we have to show those stats to our users based on different metrics, so GA does not help much here. Maybe we can work together and compile a list of top bots or something.
On Jun 17, 12:19 pm, Nick Arnett <[email protected]> wrote: > On Wed, Jun 17, 2009 at 10:45 AM, Chad Etzel <[email protected]> wrote: > > > Hi All, > > > Has anyone created a list or kept track of all the > > services/bots/crawlers that ping a site as soon as a link is posted to > > twitter? > > Below is a snapshot from my logs... just a quick listing of "unrecognized" > user-agents. > > It's a little bit difficult to sort out which are just blog bots and which > were triggered by Twitter posts. I'll have to merge data from a couple of > sources to identify that. The only one I'm certain about is Twitturly. But > there are several code library generic user agents that could easily be what > you're looking for. > > However, if your goal is to sort out the human v. robot page views, that's a > well-known problem in web analytics. As you can see from the list below, > there are many, many non-browser user agents (there are tens of thousands > and new ones every da)... and some robots that masquerade as human-operated > browsers. Some robots can be identified by the fact that they aren't > executing JavaScript, but there is no definitive way to make the distinction > you are seeking. There's an advertising industry association service that > tracks robot user agents, to allow advertisers to separate them in log file > analysis. > > The simplest way to get a rough approximation of human visitors and views is > to use script-based analytics, such as Google Analytics. Since virtually no > robots execute JavaScript, you can have high confidence that the GA numbers > are almost entirely made up of real people. If all you care about is page > views, those numbers are reasonably believable. But ask any web analytics > expert about counting unique visitors and you'll quickly learn that that is > one of the messiest numbers around... don't get me started. > > Nick > > User Agent Number of hits Data Transferred (Kb) > R6_CommentReader(www.radian6.com/crawler) 50 2,261 > WordPress/2.8;http://www.TwURLedNews.com 47 26 > Sphere Scout&v4.0 - scout at sphere dot com 34 872 > xmlrpclib.py/1.0.1 (bywww.pythonware.com) 24 52 > PycURL/7.19.0 22 995 > R6_FeedFetcher(www.radian6.com/crawler) 22 752 > WordPress/2.7 11 285 > LargeSmall Crawler 10 105 > The Incutio XML-RPC PHP Library 9 11 > WordPress/2.8;http://www.twurlednews.com/social_media 6 2 > The Incutio XML-RPC PHP Library -- WordPress/2.7 5 6 > Technoratibot/8.0 5 168 > MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.1; EN-US; RV:1.9.0.3) GECKO/2008092417 > FIREFOX/3.0.3 5 229 > Python-urllib/2.5 4 131 > BlackBerry9530/4.7.0.148 Profile/MIDP-2.0 Configuration/CLDC-1.1 > VendorID/105 3 23 > llssbot/1.0( http://labs.live.com;[email protected]) 3 115 > SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 > UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 2 19 > Incutio XML-RPC 1 1 > Jakarta Commons-HttpClient/3.1 1 54 > Python-urllib/2.4 1 51 > Twitturly / v0.6 1 41 > UniversalFeedParser/4.2-pre-294-svn http://feedparser.org/ 1 49 > uberbot 1.0 1 0 > Python-urllib/1.17 1 14
