On Wed, Jun 17, 2009 at 10:45 AM, Chad Etzel <[email protected]> wrote:

>
> Hi All,
>
> Has anyone created a list or kept track of all the
> services/bots/crawlers that ping a site as soon as a link is posted to
> twitter?


Below is a snapshot from my logs... just a quick listing of "unrecognized"
user-agents.

It's a little bit difficult to sort out which are just blog bots and which
were triggered by Twitter posts.  I'll have to merge data from a couple of
sources to identify that.  The only one I'm certain about is Twitturly.  But
there are several code library generic user agents that could easily be what
you're looking for.

However, if your goal is to sort out the human v. robot page views, that's a
well-known problem in web analytics.  As you can see from the list below,
there are many, many non-browser user agents (there are tens of thousands
and new ones every da)... and some robots that masquerade as human-operated
browsers.  Some robots can be identified by the fact that they aren't
executing JavaScript, but there is no definitive way to make the distinction
you are seeking.  There's an advertising industry association service that
tracks robot user agents, to allow advertisers to separate them in log file
analysis.

The simplest way to get a rough approximation of human visitors and views is
to use script-based analytics, such as Google Analytics.  Since virtually no
robots execute JavaScript, you can have high confidence that the GA numbers
are almost entirely made up of real people.  If all you care about is page
views, those numbers are reasonably believable.  But ask any web analytics
expert about counting unique visitors and you'll quickly learn that that is
one of the messiest numbers around... don't get me started.

Nick

User Agent    Number of hits    Data Transferred (Kb)
R6_CommentReader(www.radian6.com/crawler)    50    2,261
WordPress/2.8; http://www.TwURLedNews.com    47    26
Sphere Scout&v4.0 - scout at sphere dot com    34    872
xmlrpclib.py/1.0.1 (by www.pythonware.com)    24    52
PycURL/7.19.0    22    995
R6_FeedFetcher(www.radian6.com/crawler)    22    752
WordPress/2.7    11    285
LargeSmall Crawler    10    105
The Incutio XML-RPC PHP Library    9    11
WordPress/2.8; http://www.twurlednews.com/social_media    6    2
The Incutio XML-RPC PHP Library -- WordPress/2.7    5    6
Technoratibot/8.0    5    168
MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.1; EN-US; RV:1.9.0.3) GECKO/2008092417
FIREFOX/3.0.3    5    229
Python-urllib/2.5    4    131
BlackBerry9530/4.7.0.148 Profile/MIDP-2.0 Configuration/CLDC-1.1
VendorID/105    3    23
llssbot/1.0( http://labs.live.com;[email protected])    3    115
SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1
UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0    2    19
Incutio XML-RPC    1    1
Jakarta Commons-HttpClient/3.1    1    54
Python-urllib/2.4    1    51
Twitturly / v0.6    1    41
UniversalFeedParser/4.2-pre-294-svn  http://feedparser.org/    1    49
uberbot 1.0    1    0
Python-urllib/1.17    1    14

Reply via email to