Staying in Domain

Max Lynch Tue, 22 Jun 2010 20:07:10 -0700

I know this is a very popular question based on the searching I'vedone...but I'm still really confused.

I have a seed list that I want nutch to crawl. And I want to do verydeep crawling on each of those domains. However, I don't want nutch toventure out of each domain on that list. Also, the list is large whichprevents me from building a crawl-urlfilter entry for each domain.


I have tried
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>

But that seems to only hit the URL I specify in the url seed list, anddoesn't seem to allow nutch to venture more deeply into the domainitself. For example, it doesn't seem like it will follow a link onhttp://mydomain.com/index.html to http://mydomain.com/about.html

I have also tried db.max.outlinks.per.page but that doesn't seem to dowhat I want either.


Here is the crawl command I'm issuing:

JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2-depth 15 -topN 100 &> crawl.log

Out of a list of ~4500 seed urls, nutch only found 800 docs (404saccount for most of those).


Is there an easy way to do this?

Thanks,
Max

Staying in Domain

Reply via email to