Try using the DomainUrlFilter.  You will need to do the following:

  1. Activate the domain urlfilter in plugin.includes,
     urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
  2. In the conf directory add your domains one per line to the
     domain-urlfilter.txt file.  Entries can be domains
     (something.com), subdomains (www.something.com), or top level
     identifiers (.com)

This should work using both the crawl command and calling the individual nutch commands directly.

Dennis

On 06/22/2010 10:06 PM, Max Lynch wrote:
I know this is a very popular question based on the searching I've done...but I'm still really confused.

I have a seed list that I want nutch to crawl. And I want to do very deep crawling on each of those domains. However, I don't want nutch to venture out of each domain on that list. Also, the list is large which prevents me from building a crawl-urlfilter entry for each domain.

I have tried
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
But that seems to only hit the URL I specify in the url seed list, and doesn't seem to allow nutch to venture more deeply into the domain itself. For example, it doesn't seem like it will follow a link on http://mydomain.com/index.html to http://mydomain.com/about.html

I have also tried db.max.outlinks.per.page but that doesn't seem to do what I want either.

Here is the crawl command I'm issuing:

JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2 -depth 15 -topN 100 &> crawl.log

Out of a list of ~4500 seed urls, nutch only found 800 docs (404s account for most of those).

Is there an easy way to do this?

Thanks,
Max

Reply via email to