Crawling subdomains, but not external links

Gaspar Pizarro Wed, 18 Nov 2015 13:11:58 -0800

Hi group.

I want to crawl a bunch of sites which have subdomains. I know I can filter
external links (external with respect to the bunch of seeds) with the
db.ignore.external.links option, but if I do that, Nutch ignores subdomain
links. I know also that I can use url filtering with the
regex-urlfilter.txt file, but in that case, I have to copy the seeds in the
urlfilter, and if I want to crawl another site, I have to modify the
urlfilter each time. Is there a transparent way (I mean, a way so that I
don't have to modify the urlfilter each time I want to crawl another site)
to ignore external links but without ignoring subdomain links?


Thanks

Crawling subdomains, but not external links

Reply via email to