Thanks guys.

All the best when the bells come around.

Lewis

On Fri, Dec 30, 2011 at 7:06 PM, Ken Krugler
<[email protected]> wrote:
> Note that for polite and efficient fetching, you want to resolve shortened 
> links first, then treat some set (e.g. over a 5-10 minute interval) as 
> potential new links to be fetched.
>
> Without this step, you can wind up hammering a site (lots of different 
> shortened links will point to the same page).
>
> Ideally you use a HEAD request when resolving shortened links, though not all 
> shorteners support this properly (sigh).
>
> The simple heuristic that works is if the base domain name is less than 4 
> letters long, try a HEAD request, and handle failure gracefully.
>
> -- Ken
>
> On Dec 30, 2011, at 10:51am, Markus Jelsma wrote:
>
>> I'd not use Nutch for this, or any crawler. It's a feed and must be polled
>> frequently. With simple scripting it's easy to fetch the tweets and dump them
>> in some index.
>>
>>> Hi,
>>>
>>> I'm interested in crawling twitter feeds and haven't tried any
>>> implementation yet. Does anyone know if this is possible? I haven't
>>> seen anything on our archives to suggest that people are having
>>> problems with this.
>>>
>>> Thanks and happy NY to everyone when it comes around.
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>



-- 
Lewis

Reply via email to