Hi everyone, I'm a new user to nutch version 2.3. I have gotten it up and running, crawling some of our websites to a particular depth, and it mostly seems to work fine.
One thing I have been tasked with is the idea of crawling to one particular depth n on this main website, but then only going to depth 1 on external websites. Mainly for documents (non html stuff such as pdfs) linked externally. But it could be applied to external html pages that themselves link elsewhere. I guess that means I'd want to parse content but don't inject new URLs once we go off a whitelisted set of domains. I don't think this idea exists as a configuration option, so I have been trying to simulate it with other methods. I have played with whitelisting good domains via the urlfilter-domain plugin, and then turning db.ignore.external.links true, hoping that the ignore.external.links treated the whitelisted domains as "non external" (which doesn't seem to work). But even if it did, that wouldn't limit that external domain to a depth of m (or 1 in this case) once encountered. Has anyone else needed to do something like this? Are there other options I can try? Thanks in advance, AJ

