Crawl only ..us

Waleed Sat, 07 Jan 2012 09:00:56 -0800

Hello everyone 
I am trying to crawl only .us for example I want All domains that in all
com.us and net.us etc ...
of course I have it all in my seed list.


I set internal end external in nutch-default .xml.
......
<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
....

But I still get some documents not in my seed !!?? 
Am I missing something ?? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawl only *.*.us

Reply via email to

Crawl only ..us