Note that for polite and efficient fetching, you want to resolve shortened links first, then treat some set (e.g. over a 5-10 minute interval) as potential new links to be fetched.
Without this step, you can wind up hammering a site (lots of different shortened links will point to the same page). Ideally you use a HEAD request when resolving shortened links, though not all shorteners support this properly (sigh). The simple heuristic that works is if the base domain name is less than 4 letters long, try a HEAD request, and handle failure gracefully. -- Ken On Dec 30, 2011, at 10:51am, Markus Jelsma wrote: > I'd not use Nutch for this, or any crawler. It's a feed and must be polled > frequently. With simple scripting it's easy to fetch the tweets and dump them > in some index. > >> Hi, >> >> I'm interested in crawling twitter feeds and haven't tried any >> implementation yet. Does anyone know if this is possible? I haven't >> seen anything on our archives to suggest that people are having >> problems with this. >> >> Thanks and happy NY to everyone when it comes around. -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

