Hello everyone I am trying to crawl only .us for example I want All domains that in all com.us and net.us etc ... of course I have it all in my seed list.
I set internal end external in nutch-default .xml. ...... <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> .... But I still get some documents not in my seed !!?? Am I missing something ?? -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html Sent from the Nutch - User mailing list archive at Nabble.com.

