Hello, I would like to let you know that, currently nutch -2.x does not index redirected pages, independent of if they are parsed or not.
Thanks. Alex. -----Original Message----- From: Sebastian Nagel <[email protected]> To: user <[email protected]> Sent: Mon, Mar 25, 2013 3:52 pm Subject: Re: parsechecker and redirection Hi Lewis, let's address NUTCH-1038, NUTCH-1389, NUTCH-1419, and NUTCH-1501! On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote: > Thanks for clarification on this one Seb. > I was aware that you were clued up on this and hoped you would drrop in. > > On Monday, March 25, 2013, Sebastian Nagel <[email protected]> > wrote: >> Hi Canan, hi Lewis, >> >> parsechecker cannot follow redirects, also in trunk / 1.x. >> >> It would be nice, at least, if parsechecker would report >> clearly that there is a redirect. Currently, you have to check >> content metadata for the redirect target which is easy to overlook. >> >> % nutch parsechecker http://apachecon.eu >> ... >> Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location= > http://www.apachecon.eu/ >> ... >> >> There is already NUTCH-1419: report redirect and do not parse. >> @Lewis: I'll review the latest patch soon, so we can sort this out. >> >> @Canan: feel free to open a new Jira to make parsechecker follow > redirects. Thanks! >> >> Sebastian >> >> >> On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote: >>> Hi Canan, >>> Thank you for bringing this up, I just noticed that 2.x does not have the >>> configurable property in nutch-default.xml >>> >>> <property> >>> <name>http.redirect.max</name> >>> <value>0</value> >>> <description>The maximum number of redirects the fetcher will follow > when >>> trying to fetch a page. If set to negative or 0, fetcher won't > immediately >>> follow redirected URLs, instead it will record them for later fetching. >>> </description> >>> </property> >>> >>> I've also looked over the trunk and 2.x branches and it seems that with >>> regards to handling redirects, trunk is more functionally capable. >>> I don't have time to look into this just now. >>> You can begin looking in to the trunk code before the 2.x in an attempt > to >>> see how redirects should be handled and how a configurable depth can be >>> specified for fetching of such URLs. >>> It seems that we need to add such functionality to 2.x. >>> Contributions would be very very welcome on this issue. >>> Lewis >>> >>> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <[email protected] >> wrote: >>> >>>> Hi, >>>> >>>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when > I >>>> try parsechecker command with redirected page,parseFilters turns wrong >>>> results. Because parse text contains redirect descriptions. >>>> >>>> Is there any problem? >>>> >>>> Thanks, Canan >>>> >>>> Nutch 2.1 / Ubuntu 12.04 / MySQL >>>> >>> >>> >>> >> >> >

