If the seed list is large it would probably easier to write a custom Scoring Filter combined with metadata in the Injector e.g.
1. add a metadata to your seed list e.g. '_origin_' with as values the seed URL e.g. http://www.cnn.com/ _origin_=http://www.cnn.com/ (a bit tautological I know but nevermind) 2. The custom scoring filter would take care of : - transmitting the origin metadata to its outlinks - remove from the outlinks the ones which do not have the same host / domain as the origin The method *distributeScoreToOutlinks* being where it all happens Of course if your seed list is not large or if you don't mind editing the list of domains by hand then Dennis' suggestion is a very good solution J. PS: we've implemented that for one of my clients. I'll check with them whether they are happy to donate it to the project and if so add it to JIRA -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com On 23 June 2010 04:35, Dennis Kubes <[email protected]> wrote: > Try using the DomainUrlFilter. You will need to do the following: > > 1. Activate the domain urlfilter in plugin.includes, > urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file. > 2. In the conf directory add your domains one per line to the > domain-urlfilter.txt file. Entries can be domains > (something.com), subdomains (www.something.com), or top level > identifiers (.com) > > This should work using both the crawl command and calling the individual > nutch commands directly. > > Dennis > > > On 06/22/2010 10:06 PM, Max Lynch wrote: > >> I know this is a very popular question based on the searching I've >> done...but I'm still really confused. >> >> I have a seed list that I want nutch to crawl. And I want to do very deep >> crawling on each of those domains. However, I don't want nutch to venture >> out of each domain on that list. Also, the list is large which prevents me >> from building a crawl-urlfilter entry for each domain. >> >> I have tried >> <property> >> <name>db.ignore.external.links</name> >> <value>true</value> >> </property> >> But that seems to only hit the URL I specify in the url seed list, and >> doesn't seem to allow nutch to venture more deeply into the domain itself. >> For example, it doesn't seem like it will follow a link on >> http://mydomain.com/index.html to http://mydomain.com/about.html >> >> I have also tried db.max.outlinks.per.page but that doesn't seem to do >> what I want either. >> >> Here is the crawl command I'm issuing: >> >> JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2 -depth >> 15 -topN 100 &> crawl.log >> >> Out of a list of ~4500 seed urls, nutch only found 800 docs (404s account >> for most of those). >> >> Is there an easy way to do this? >> >> Thanks, >> Max >> >

