Note that for polite and efficient fetching, you want to resolve shortened 
links first, then treat some set (e.g. over a 5-10 minute interval) as 
potential new links to be fetched.

Without this step, you can wind up hammering a site (lots of different 
shortened links will point to the same page).

Ideally you use a HEAD request when resolving shortened links, though not all 
shorteners support this properly (sigh).

The simple heuristic that works is if the base domain name is less than 4 
letters long, try a HEAD request, and handle failure gracefully.

-- Ken 

On Dec 30, 2011, at 10:51am, Markus Jelsma wrote:

> I'd not use Nutch for this, or any crawler. It's a feed and must be polled 
> frequently. With simple scripting it's easy to fetch the tweets and dump them 
> in some index.
> 
>> Hi,
>> 
>> I'm interested in crawling twitter feeds and haven't tried any
>> implementation yet. Does anyone know if this is possible? I haven't
>> seen anything on our archives to suggest that people are having
>> problems with this.
>> 
>> Thanks and happy NY to everyone when it comes around.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to