Hi Canan, hi Lewis,

parsechecker cannot follow redirects, also in trunk / 1.x.

It would be nice, at least, if parsechecker would report
clearly that there is a redirect. Currently, you have to check
content metadata for the redirect target which is easy to overlook.

% nutch parsechecker http://apachecon.eu
...
Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT 
Location=http://www.apachecon.eu/
...

There is already NUTCH-1419: report redirect and do not parse.
@Lewis: I'll review the latest patch soon, so we can sort this out.

@Canan: feel free to open a new Jira to make parsechecker follow redirects. 
Thanks!

Sebastian


On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
> Hi Canan,
> Thank you for bringing this up, I just noticed that 2.x does not have the
> configurable property in nutch-default.xml
> 
> <property>
>   <name>http.redirect.max</name>
>   <value>0</value>
>   <description>The maximum number of redirects the fetcher will follow when
>   trying to fetch a page. If set to negative or 0, fetcher won't immediately
>   follow redirected URLs, instead it will record them for later fetching.
>   </description>
> </property>
> 
> I've also looked over the trunk and 2.x branches and it seems that with
> regards to handling redirects, trunk is more functionally capable.
> I don't have time to look into this just now.
> You can begin looking in to the trunk code before the 2.x in an attempt to
> see how redirects should be handled and how a configurable depth can be
> specified for fetching of such URLs.
> It seems that we need to add such functionality to 2.x.
> Contributions would be very very welcome on this issue.
> Lewis
> 
> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <[email protected]>wrote:
> 
>> Hi,
>>
>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when I
>> try parsechecker command with redirected page,parseFilters turns wrong
>> results. Because parse text contains redirect descriptions.
>>
>> Is there any problem?
>>
>> Thanks, Canan
>>
>> Nutch 2.1 / Ubuntu 12.04 / MySQL
>>
> 
> 
> 

Reply via email to