I tried it with 1.10, but the shortened urls still dont get followed through. I think theres another issue here maybe with how shortened urls work. Not to mention the crawl command for 1.10 is diff than previous versions n absolutely undocumented. That became another task in itself. The tutorial and wiki only talks about the previous bin/crawl command structure. Thanks though.
On Wed, Jun 3, 2015 at 7:43 AM, Jorge Luis Betancourt González < [email protected]> wrote: > Since you're using Nutch 1.9 you should check [1] there is a bug with the > http.redirect.max setting (fixed in 1.10), a workaround is basically to set > http.redirect.max = 0 and follow the redirect in the next cycle. This is > basically without taking into account that you're dealing with shortened > URLs, but it should work as a normal redirect. > > Regards, > > [1] https://issues.apache.org/jira/browse/NUTCH-1939 > > ----- Original Message ----- > From: "Ankit Goel" <[email protected]> > To: [email protected] > Sent: Tuesday, June 2, 2015 9:59:40 PM > Subject: [MASSMAIL]Can Nutch crawling shortened url? > > Hi, > I was playing around with nutch 1.9 when I came across some twitter t.co > links. When I ran it through parsechecker, I got failed fetch protocol > status : moved(12). I have set my http.redirect.max count to 5 > (experimented with 10) which works for other links, but didnt seem to > redirect me. I did get a forwarding link. For example, > > bin/nutch parsechecker http://t.co/FcpZhY9FrL > > Fetch failed with protocol status: moved(12), lastModified=0: > > http://timesofindia.indiatimes.com/city/nashik/1st-seaplane-service-from-Ozar-to-Pune-begins-from-June-15/articleshow/47522116.cms?utm_source=twitter.com&utm_medium=referral&utm_campaign=timesofindia > > running the forwarding link seperately works fine. I've tried this with a > bitly link which had a double forward to goo.gl and the final site, but > each time I had to crawl the forwarding link in a seperate command. > > My regex filter has the rule to allow t.co > > +^http://t.co > > +^http://t.co/[a-z0-9]* > > Is there a way to crawl shortened urls seemlessly in nutch?? > > -- > Regards, > Ankit Goel > http://about.me/ankitgoel > -- Regards, Ankit Goel http://about.me/ankitgoel

