Re: Crawling subdomains, but not external links

Julien Nioche Wed, 18 Nov 2015 14:18:35 -0800

Hi Gaspar

Have a look at https://issues.apache.org/jira/browse/NUTCH-2069, this
should allow you to restrict the crawl to the domain and not just the
hostname. Hasn't been committed yet as Seb had suggested some improvements.


HTH

Julien

On 18 November 2015 at 21:10, Gaspar Pizarro <[email protected]>
wrote:

> Hi group.
>
> I want to crawl a bunch of sites which have subdomains. I know I can filter
> external links (external with respect to the bunch of seeds) with the
> db.ignore.external.links option, but if I do that, Nutch ignores subdomain
> links. I know also that I can use url filtering with the
> regex-urlfilter.txt file, but in that case, I have to copy the seeds in the
> urlfilter, and if I want to crawl another site, I have to modify the
> urlfilter each time. Is there a transparent way (I mean, a way so that I
> don't have to modify the urlfilter each time I want to crawl another site)
> to ignore external links but without ignoring subdomain links?
>
> Thanks
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Crawling subdomains, but not external links

Reply via email to