Hi Patricia, I wish I had a generic solution for this problem, but I managed to fix http://www.abc.com -> http://abc.com[http://abc.com] problem with an extension of url exemption filter for both ways (www.abc.com -> abc.com and abc.com -> www.abc.com). https://jira.apache.org/jira/browse/NUTCH-2522
You need to replicate this logic in an indexer, if you want to have www.abc.com, abc.com with under the same hostname. Semyon Sent: Monday, November 26, 2018 at 12:19 PM From: "Patricia Helmich" <[email protected]> To: "[email protected]" <[email protected]> Subject: Ignore external links but allow redirections to external websites Hi, I am using Nutch with a seed set of URLS and I want to crawl all internal links found on the crawled websites. The external links should be ignored in my crawler, so I set the "db.ignore.external.links" in nutch-site.xml to "true". This works perfectly in order to ignore the external links. However, when a a seed URL redirects to another URL, I want to crawl the redirected URL, even if it's external. For example, if I have a seed URL like http://www.abc.com[http://www.abc.com] and it redirects to http://abc.com[http://abc.com], the crawl process stops because the domain without www is an external link. (If I set "db.ignore.external.links" in nutch-site.xml to "false", the crawl process does continue, but in that case, it also crawls all external links on the site which I don't want it to.) So, my question is: Is there a possibility to ignore external links but allow redirections to external websites? Thanks for your help, Patricia

