Try using the DomainUrlFilter. You will need to do the following:
1. Activate the domain urlfilter in plugin.includes,
urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
2. In the conf directory add your domains one per line to the
domain-urlfilter.txt file. Entries can be domains
(something.com), subdomains (www.something.com), or top level
identifiers (.com)
This should work using both the crawl command and calling the individual
nutch commands directly.
Dennis
On 06/22/2010 10:06 PM, Max Lynch wrote:
I know this is a very popular question based on the searching I've
done...but I'm still really confused.
I have a seed list that I want nutch to crawl. And I want to do very
deep crawling on each of those domains. However, I don't want nutch
to venture out of each domain on that list. Also, the list is large
which prevents me from building a crawl-urlfilter entry for each domain.
I have tried
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
But that seems to only hit the URL I specify in the url seed list, and
doesn't seem to allow nutch to venture more deeply into the domain
itself. For example, it doesn't seem like it will follow a link on
http://mydomain.com/index.html to http://mydomain.com/about.html
I have also tried db.max.outlinks.per.page but that doesn't seem to
do what I want either.
Here is the crawl command I'm issuing:
JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2
-depth 15 -topN 100 &> crawl.log
Out of a list of ~4500 seed urls, nutch only found 800 docs (404s
account for most of those).
Is there an easy way to do this?
Thanks,
Max