See https://issues.apache.org/jira/browse/NUTCH-830
This patch has been kindly donated by Ant.com On 23 June 2010 13:22, Julien Nioche <[email protected]> wrote: > If the seed list is large it would probably easier to write a custom > Scoring Filter combined with metadata in the Injector e.g. > > 1. add a metadata to your seed list e.g. '_origin_' with as values the seed > URL > e.g. http://www.cnn.com/ _origin_=http://www.cnn.com/ (a bit > tautological I know but nevermind) > > 2. The custom scoring filter would take care of : > > - transmitting the origin metadata to its outlinks > - remove from the outlinks the ones which do not have the same host / > domain as the origin > > The method *distributeScoreToOutlinks* being where it all happens > > Of course if your seed list is not large or if you don't mind editing the > list of domains by hand then Dennis' suggestion is a very good solution > > J. > > PS: we've implemented that for one of my clients. I'll check with them > whether they are happy to donate it to the project and if so add it to JIRA > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > > > > On 23 June 2010 04:35, Dennis Kubes <[email protected]> wrote: > >> Try using the DomainUrlFilter. You will need to do the following: >> >> 1. Activate the domain urlfilter in plugin.includes, >> urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file. >> 2. In the conf directory add your domains one per line to the >> domain-urlfilter.txt file. Entries can be domains >> (something.com), subdomains (www.something.com), or top level >> identifiers (.com) >> >> This should work using both the crawl command and calling the individual >> nutch commands directly. >> >> Dennis >> >> >> On 06/22/2010 10:06 PM, Max Lynch wrote: >> >>> I know this is a very popular question based on the searching I've >>> done...but I'm still really confused. >>> >>> I have a seed list that I want nutch to crawl. And I want to do very >>> deep crawling on each of those domains. However, I don't want nutch to >>> venture out of each domain on that list. Also, the list is large which >>> prevents me from building a crawl-urlfilter entry for each domain. >>> >>> I have tried >>> <property> >>> <name>db.ignore.external.links</name> >>> <value>true</value> >>> </property> >>> But that seems to only hit the URL I specify in the url seed list, and >>> doesn't seem to allow nutch to venture more deeply into the domain itself. >>> For example, it doesn't seem like it will follow a link on >>> http://mydomain.com/index.html to http://mydomain.com/about.html >>> >>> I have also tried db.max.outlinks.per.page but that doesn't seem to do >>> what I want either. >>> >>> Here is the crawl command I'm issuing: >>> >>> JAVA_HOME=/usr/lib/jvm/java-6-sun bin/nutch crawl urls -dir crawl2 -depth >>> 15 -topN 100 &> crawl.log >>> >>> Out of a list of ~4500 seed urls, nutch only found 800 docs (404s account >>> for most of those). >>> >>> Is there an easy way to do this? >>> >>> Thanks, >>> Max >>> >> > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

