Ignore external links but allow redirections to external websites

Patricia Helmich Mon, 26 Nov 2018 03:19:27 -0800

Hi,

I am using Nutch with a seed set of URLS and I want to crawl all internal links 
found on the crawled websites. The external links should be ignored in my 
crawler, so I set the "db.ignore.external.links" in nutch-site.xml to "true". 
This works perfectly in order to ignore the external links. However, when a a 
seed URL redirects to another URL, I want to crawl the redirected URL, even if 
it's external. For example, if I have a seed URL like http://www.abc.com and it 
redirects to http://abc.com, the crawl process stops because the domain without 
www is an external link. (If I set "db.ignore.external.links" in nutch-site.xml 
to "false", the crawl process does continue, but in that case, it also crawls 
all external links on the site which I don't want it to.)


So, my question is: Is there a possibility to ignore external links but allow 
redirections to external websites?

Thanks for your help,
Patricia

Ignore external links but allow redirections to external websites

Reply via email to