You can use the domain url filter to crawl only urls in the listed domains.
> Hello everyone > I am trying to crawl only .us for example I want All domains that in all > com.us and net.us etc ... > of course I have it all in my seed list. > > I set internal end external in nutch-default .xml. > ...... > <property> > <name>db.ignore.internal.links</name> > <value>false</value> > <description>If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > </description> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>true</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to include > only initially injected hosts, without creating complex URLFilters. > </description> > </property> > .... > > But I still get some documents not in my seed !!?? > Am I missing something ?? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html > Sent from the Nutch - User mailing list archive at Nabble.com.

