On Tue, Jun 22, 2010 at 10:35 PM, Dennis Kubes <[email protected]> wrote:

> Try using the DomainUrlFilter.  You will need to do the following:
>
>  1. Activate the domain urlfilter in plugin.includes,
>     urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
>  2. In the conf directory add your domains one per line to the
>     domain-urlfilter.txt file.  Entries can be domains
>     (something.com), subdomains (www.something.com), or top level
>     identifiers (.com)
>
> This should work using both the crawl command and calling the individual
> nutch commands directly.
>
> Dennis


That seems to be working but there are so many documents I can't tell if the
filter is working.  Is there a way to verify it?  I guess I could just parse
the log output.

Reply via email to