Hi,
I was playing around with nutch 1.9 when I came across some twitter t.co
links. When I ran it through parsechecker, I got failed fetch protocol
status : moved(12).  I have set my http.redirect.max count to 5
(experimented with 10) which works for other links, but didnt seem to
redirect me. I did get a forwarding link. For example,

bin/nutch parsechecker http://t.co/FcpZhY9FrL

Fetch failed with protocol status: moved(12), lastModified=0:
http://timesofindia.indiatimes.com/city/nashik/1st-seaplane-service-from-Ozar-to-Pune-begins-from-June-15/articleshow/47522116.cms?utm_source=twitter.com&utm_medium=referral&utm_campaign=timesofindia

running the forwarding link seperately works fine.  I've tried this with a
bitly link which had a double forward to goo.gl and the final site, but
each time I had to crawl the forwarding link in a seperate command.

My regex filter has the rule to allow t.co

+^http://t.co

+^http://t.co/[a-z0-9]*

Is there a way to crawl shortened urls seemlessly in nutch??

-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Reply via email to