You can use the domain url filter to crawl only urls in the listed domains.

> Hello everyone
> I am trying to crawl only .us for example I want All domains that in all
> com.us and net.us etc ...
> of course I have it all in my seed list.
> 
> I set internal end external in nutch-default .xml.
> ......
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
> 
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
> ....
> 
> But I still get some documents not in my seed !!??
> Am I missing something ??
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-only-us-tp3639778p3639778.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to