Hi Waleed,
in nutch-default.xml:
<property>
<name>plugin.includes</name>
<value>domain-urlfilter.txt</value>
</property>
No, you have to adapt the property so that among other plugins
urlfilter-domain is accepted by the regular expression. E.g.:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|domain)|parse-...</value>
</property>
> And in domain-urlfilter.txt :
> I add just :
> .us
> And then I 'll be OK to go ?
> And in domain-urlfilter.txt :
> I add just :
> .us
No, it should be just:
us
This thread might also help:
http://lucene.472066.n3.nabble.com/Getting-domain-urlfilter-to-work-td618253.html
But your first solution
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> </property>
should do the same. Only documents from the hosts in your seed list are crawled.
> But I still get some documents not in my seed !!??
If you want to crawl only the seed list it's easier to set -depth to 1 and
set -topN so that your seed list fits in.
Bye, Sebastian