See also https://issues.apache.org/jira/browse/NUTCH-1939
(it's a bug in Nutch 1.9)

On 03/19/2015 10:10 PM, Sebastian Nagel wrote:
> Hi Marko,
> 
> even with
>   http.redirect.max == 0
> Nutch follows redirect but they are like ordinary links
> recorded for fetch in the next round(s).
> 
>> The first fetch seems to download something, but the second generate job
>> doesn't appear to produce a new segment,
> Are the redirect targets accepted by the URL filter patterns?
> 
>> How can I look at the crawl db and segment data contents (esp. fetch list)?
>> I'm running Nutch in local mode.
> % bin/nutch readdb ...
> % bin/nutch readseg ...
> Help is shown when called without arguments.
> 
> Best,
> Sebastian
> 
> On 03/18/2015 11:02 AM, Marko Asplund wrote:
>> Hi,
>>
>> I'm a newbie having trouble getting Nutch 1.9 to crawl a site that does a
>> HTTP 301 redirect from http/80 to https/443.
>> Nutch fetch job issues the following message:
>>
>> redirect count exceeded http://www.foo.com/
>>
>> and it seems that nothing actually gets fetched.
>> I've set http.redirect.max parameter value to 50.
>>
>> I've only injected one seed URL to Nutch.
>> The first fetch seems to download something, but the second generate job
>> doesn't appear to produce a new segment,
>> since there's only one segment in crawl DB after running it.
>>
>> How can I debug problem?
>>
>> Is there a way to make Nutch logging more verbose? I've set
>> http.verbose, but that didn't help.
>>
>> How can I look at the crawl db and segment data contents (esp. fetch list)?
>> I'm running Nutch in local mode.
>>
>> marko
>>
> 

Reply via email to