On Wed, Jun 17, 2009 at 10:45 AM, Chad Etzel <[email protected]> wrote:
> > Hi All, > > Has anyone created a list or kept track of all the > services/bots/crawlers that ping a site as soon as a link is posted to > twitter? Below is a snapshot from my logs... just a quick listing of "unrecognized" user-agents. It's a little bit difficult to sort out which are just blog bots and which were triggered by Twitter posts. I'll have to merge data from a couple of sources to identify that. The only one I'm certain about is Twitturly. But there are several code library generic user agents that could easily be what you're looking for. However, if your goal is to sort out the human v. robot page views, that's a well-known problem in web analytics. As you can see from the list below, there are many, many non-browser user agents (there are tens of thousands and new ones every da)... and some robots that masquerade as human-operated browsers. Some robots can be identified by the fact that they aren't executing JavaScript, but there is no definitive way to make the distinction you are seeking. There's an advertising industry association service that tracks robot user agents, to allow advertisers to separate them in log file analysis. The simplest way to get a rough approximation of human visitors and views is to use script-based analytics, such as Google Analytics. Since virtually no robots execute JavaScript, you can have high confidence that the GA numbers are almost entirely made up of real people. If all you care about is page views, those numbers are reasonably believable. But ask any web analytics expert about counting unique visitors and you'll quickly learn that that is one of the messiest numbers around... don't get me started. Nick User Agent Number of hits Data Transferred (Kb) R6_CommentReader(www.radian6.com/crawler) 50 2,261 WordPress/2.8; http://www.TwURLedNews.com 47 26 Sphere Scout&v4.0 - scout at sphere dot com 34 872 xmlrpclib.py/1.0.1 (by www.pythonware.com) 24 52 PycURL/7.19.0 22 995 R6_FeedFetcher(www.radian6.com/crawler) 22 752 WordPress/2.7 11 285 LargeSmall Crawler 10 105 The Incutio XML-RPC PHP Library 9 11 WordPress/2.8; http://www.twurlednews.com/social_media 6 2 The Incutio XML-RPC PHP Library -- WordPress/2.7 5 6 Technoratibot/8.0 5 168 MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.1; EN-US; RV:1.9.0.3) GECKO/2008092417 FIREFOX/3.0.3 5 229 Python-urllib/2.5 4 131 BlackBerry9530/4.7.0.148 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/105 3 23 llssbot/1.0( http://labs.live.com;[email protected]) 3 115 SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 2 19 Incutio XML-RPC 1 1 Jakarta Commons-HttpClient/3.1 1 54 Python-urllib/2.4 1 51 Twitturly / v0.6 1 41 UniversalFeedParser/4.2-pre-294-svn http://feedparser.org/ 1 49 uberbot 1.0 1 0 Python-urllib/1.17 1 14
