Re: Crawl only ..us

Sebastian Nagel Sun, 08 Jan 2012 09:25:25 -0800

Hi Waleed,

in nutch-default.xml:


<property>
<name>plugin.includes</name>
<value>domain-urlfilter.txt</value>
</property>


No, you have to adapt the property so that among other plugins
urlfilter-domain is accepted by the regular expression. E.g.:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-(regex|domain)|parse-...</value>
</property>

> And in domain-urlfilter.txt :
> I add just :
> .us
> And then I 'll be OK to go ?
> And in domain-urlfilter.txt :
> I add just :
> .us

No, it should be just:

us

This thread might also help:
http://lucene.472066.n3.nabble.com/Getting-domain-urlfilter-to-work-td618253.html

But your first solution
> <property>
>    <name>db.ignore.external.links</name>
>    <value>true</value>
> </property>
should do the same. Only documents from the hosts in your seed list are crawled.

> But I still get some documents not in my seed !!??
If you want to crawl only the seed list it's easier to set -depth to 1 and
set -topN so that your seed list fits in.

Bye, Sebastian

Re: Crawl only *.*.us

Reply via email to

Re: Crawl only ..us