Hi Gaspar Have a look at https://issues.apache.org/jira/browse/NUTCH-2069, this should allow you to restrict the crawl to the domain and not just the hostname. Hasn't been committed yet as Seb had suggested some improvements.
HTH Julien On 18 November 2015 at 21:10, Gaspar Pizarro <[email protected]> wrote: > Hi group. > > I want to crawl a bunch of sites which have subdomains. I know I can filter > external links (external with respect to the bunch of seeds) with the > db.ignore.external.links option, but if I do that, Nutch ignores subdomain > links. I know also that I can use url filtering with the > regex-urlfilter.txt file, but in that case, I have to copy the seeds in the > urlfilter, and if I want to crawl another site, I have to modify the > urlfilter each time. Is there a transparent way (I mean, a way so that I > don't have to modify the urlfilter each time I want to crawl another site) > to ignore external links but without ignoring subdomain links? > > Thanks > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

