Crawl External Sites to Depth of 1

AJ Ferrigno Mon, 30 Mar 2015 11:45:13 -0700

Hi everyone,

I'm a new user to nutch version 2.3. I have gotten it up and running,
crawling some of our websites to a particular depth, and it mostly seems to
work fine.


One thing I have been tasked with is the idea of crawling to one particular
depth n on this main website, but then only going to depth 1 on external
websites. Mainly for documents (non html stuff such as pdfs) linked
externally. But it could be applied to external html pages that themselves
link elsewhere. I guess that means I'd want to parse content but don't
inject new URLs once we go off a whitelisted set of domains.

I don't think this idea exists as a configuration option, so I have been
trying to simulate it with other methods. I have played with whitelisting
good domains via the urlfilter-domain plugin, and then turning
db.ignore.external.links true, hoping that the ignore.external.links
treated the whitelisted domains as "non external" (which doesn't seem to
work). But even if it did, that wouldn't limit that external domain to a
depth of m (or 1 in this case) once encountered.

Has anyone else needed to do something like this? Are there other options I
can try?

Thanks in advance,
AJ

Crawl External Sites to Depth of 1

Reply via email to