I know this is a very popular question based on the searching I've done...but I'm still really confused.

I have a seed list that I want nutch to crawl. And I want to do very deep crawling on each of those domains. However, I don't want nutch to venture out of each domain on that list. Also, the list is large which prevents me from building a crawl-urlfilter entry for each domain.

I have tried
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
But that seems to only hit the URL I specify in the url seed list, and doesn't seem to allow nutch to venture more deeply into the domain itself. For example, it doesn't seem like it will follow a link on http://mydomain.com/index.html to http://mydomain.com/about.html

I have also tried db.max.outlinks.per.page but that doesn't seem to do what I want either.

Here is the crawl command I'm issuing:

JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2 -depth 15 -topN 100 &> crawl.log

Out of a list of ~4500 seed urls, nutch only found 800 docs (404s account for most of those).

Is there an easy way to do this?

Thanks,
Max

Reply via email to