Hi Alex,

this is not really a bug. It's a "undocumented" feature.
db.ignore.external.links prevents the fetcher from breaking
out of your set of domains. And this is what you need, if you
won't crawl the whole web.

Best regards,
Rafael.


On 17/Nov/ 2011, at 23:05 , [email protected] wrote:

> 
> Hi,
> 
> Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044
> for the case when 
> db.ignore.external.links set to true
> ?
> 
> Thanks.
> Alex.
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Ferdy Galema <[email protected]>
> To: user <[email protected]>
> Sent: Thu, Nov 17, 2011 6:01 am
> Subject: Re: http.redirect.max
> 
> 
> Thanks for updating the list.
> 
> On 11/17/2011 02:52 PM, Rafael Pappert wrote:
>> Hi,
>> 
>> after some investigation i got the problem.
>> I had db.ignore.external.links set to true, this is why
>> fetcher isn't following the redirection from domain.com to
>> www.domain.com.
>> 
>> Rafael.
>> 
>> 
>> 
>> On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote:
>> 
>>> Hello List,
>>> 
>>> is it possible to follow http 301 redirects immediately?
>>> 
>>> I tried to set http.redirect.max to 3 but the page is
>>> still not indexed. readdb is still showing 1 page is
>>> unfetched / db_redir_perm. And I can't find the
>>> redirection target in the crawldb.
>>> 
>>> How does nutch handle redirects?
>>> 
>>> Thanks in advance,
>>> Rafael.
>>> 
>>> 
>>> 
>>> 
> 
> 

Reply via email to