Thanks guys. All the best when the bells come around.
Lewis On Fri, Dec 30, 2011 at 7:06 PM, Ken Krugler <[email protected]> wrote: > Note that for polite and efficient fetching, you want to resolve shortened > links first, then treat some set (e.g. over a 5-10 minute interval) as > potential new links to be fetched. > > Without this step, you can wind up hammering a site (lots of different > shortened links will point to the same page). > > Ideally you use a HEAD request when resolving shortened links, though not all > shorteners support this properly (sigh). > > The simple heuristic that works is if the base domain name is less than 4 > letters long, try a HEAD request, and handle failure gracefully. > > -- Ken > > On Dec 30, 2011, at 10:51am, Markus Jelsma wrote: > >> I'd not use Nutch for this, or any crawler. It's a feed and must be polled >> frequently. With simple scripting it's easy to fetch the tweets and dump them >> in some index. >> >>> Hi, >>> >>> I'm interested in crawling twitter feeds and haven't tried any >>> implementation yet. Does anyone know if this is possible? I haven't >>> seen anything on our archives to suggest that people are having >>> problems with this. >>> >>> Thanks and happy NY to everyone when it comes around. > > -------------------------- > Ken Krugler > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > > -- Lewis

