I know this is a very popular question based on the searching I've
done...but I'm still really confused.
I have a seed list that I want nutch to crawl. And I want to do very
deep crawling on each of those domains. However, I don't want nutch to
venture out of each domain on that list. Also, the list is large which
prevents me from building a crawl-urlfilter entry for each domain.
I have tried
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
But that seems to only hit the URL I specify in the url seed list, and
doesn't seem to allow nutch to venture more deeply into the domain
itself. For example, it doesn't seem like it will follow a link on
http://mydomain.com/index.html to http://mydomain.com/about.html
I have also tried db.max.outlinks.per.page but that doesn't seem to do
what I want either.
Here is the crawl command I'm issuing:
JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2
-depth 15 -topN 100 &> crawl.log
Out of a list of ~4500 seed urls, nutch only found 800 docs (404s
account for most of those).
Is there an easy way to do this?
Thanks,
Max