There is one more thing.

You(we do it) can do it outside of Nutch. You can create a program that 
validates the seed list urls and save redirects as an input for Nutch.
 

Sent: Monday, November 26, 2018 at 2:43 PM
From: "Semyon Semyonov" <[email protected]>
To: [email protected]
Subject: Re: Ignore external links but allow redirections to external websites
Hi Patricia,

I wish I had a generic solution for this problem, but I managed to fix 
http://www.abc.com -http://abc.com[http://abc.com][http://abc.com[http://abc.com]][http://abc.com[http://abc.com][http://abc.com[http://abc.com]]]
 problem with an extension of url exemption filter for both ways 
(www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]] -> 
abc.com and abc.com -> 
www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]]). 
https://jira.apache.org/jira/browse/NUTCH-2522[https://jira.apache.org/jira/browse/NUTCH-2522][https://jira.apache.org/jira/browse/NUTCH-2522[https://jira.apache.org/jira/browse/NUTCH-2522]]

You need to replicate this logic in an indexer, if you want to have 
www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]], 
abc.com with under the same hostname.
 
Semyon

  
 

Sent: Monday, November 26, 2018 at 12:19 PM
From: "Patricia Helmich" <[email protected]>
To: "[email protected]" <[email protected]>
Subject: Ignore external links but allow redirections to external websites
Hi,

I am using Nutch with a seed set of URLS and I want to crawl all internal links 
found on the crawled websites. The external links should be ignored in my 
crawler, so I set the "db.ignore.external.links" in nutch-site.xml to "true". 
This works perfectly in order to ignore the external links. However, when a a 
seed URL redirects to another URL, I want to crawl the redirected URL, even if 
it's external. For example, if I have a seed URL like 
http://www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]][http://www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]]]
 and it redirects to 
http://abc.com[http://abc.com][http://abc.com[http://abc.com]][http://abc.com[http://abc.com][http://abc.com[http://abc.com]]],
 the crawl process stops because the domain without www is an external link. 
(If I set "db.ignore.external.links" in nutch-site.xml to "false", the crawl 
process does continue, but in that case, it also crawls all external links on 
the site which I don't want it to.)

So, my question is: Is there a possibility to ignore external links but allow 
redirections to external websites?

Thanks for your help,
Patricia
 

Reply via email to