Re: Staying in Domain

Dennis Kubes Tue, 22 Jun 2010 20:36:20 -0700

Try using the DomainUrlFilter.  You will need to do the following:


  1. Activate the domain urlfilter in plugin.includes,
     urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
  2. In the conf directory add your domains one per line to the
     domain-urlfilter.txt file.  Entries can be domains
     (something.com), subdomains (www.something.com), or top level
     identifiers (.com)

This should work using both the crawl command and calling the individualnutch commands directly.


Dennis

On 06/22/2010 10:06 PM, Max Lynch wrote:

I know this is a very popular question based on the searching I'vedone...but I'm still really confused.
I have a seed list that I want nutch to crawl. And I want to do verydeep crawling on each of those domains. However, I don't want nutchto venture out of each domain on that list. Also, the list is largewhich prevents me from building a crawl-urlfilter entry for each domain.
I have tried
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
But that seems to only hit the URL I specify in the url seed list, anddoesn't seem to allow nutch to venture more deeply into the domainitself. For example, it doesn't seem like it will follow a link onhttp://mydomain.com/index.html to http://mydomain.com/about.html
I have also tried db.max.outlinks.per.page but that doesn't seem todo what I want either.
Here is the crawl command I'm issuing:
JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2-depth 15 -topN 100 &> crawl.log
Out of a list of ~4500 seed urls, nutch only found 800 docs (404saccount for most of those).
Is there an easy way to do this?

Thanks,
Max

Re: Staying in Domain

Reply via email to