Hi Canan, hi Lewis, parsechecker cannot follow redirects, also in trunk / 1.x.
It would be nice, at least, if parsechecker would report clearly that there is a redirect. Currently, you have to check content metadata for the redirect target which is easy to overlook. % nutch parsechecker http://apachecon.eu ... Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=http://www.apachecon.eu/ ... There is already NUTCH-1419: report redirect and do not parse. @Lewis: I'll review the latest patch soon, so we can sort this out. @Canan: feel free to open a new Jira to make parsechecker follow redirects. Thanks! Sebastian On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote: > Hi Canan, > Thank you for bringing this up, I just noticed that 2.x does not have the > configurable property in nutch-default.xml > > <property> > <name>http.redirect.max</name> > <value>0</value> > <description>The maximum number of redirects the fetcher will follow when > trying to fetch a page. If set to negative or 0, fetcher won't immediately > follow redirected URLs, instead it will record them for later fetching. > </description> > </property> > > I've also looked over the trunk and 2.x branches and it seems that with > regards to handling redirects, trunk is more functionally capable. > I don't have time to look into this just now. > You can begin looking in to the trunk code before the 2.x in an attempt to > see how redirects should be handled and how a configurable depth can be > specified for fetching of such URLs. > It seems that we need to add such functionality to 2.x. > Contributions would be very very welcome on this issue. > Lewis > > On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <[email protected]>wrote: > >> Hi, >> >> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when I >> try parsechecker command with redirected page,parseFilters turns wrong >> results. Because parse text contains redirect descriptions. >> >> Is there any problem? >> >> Thanks, Canan >> >> Nutch 2.1 / Ubuntu 12.04 / MySQL >> > > >

