On Tue, Jun 22, 2010 at 10:35 PM, Dennis Kubes <[email protected]> wrote:
> Try using the DomainUrlFilter. You will need to do the following: > > 1. Activate the domain urlfilter in plugin.includes, > urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file. > 2. In the conf directory add your domains one per line to the > domain-urlfilter.txt file. Entries can be domains > (something.com), subdomains (www.something.com), or top level > identifiers (.com) > > This should work using both the crawl command and calling the individual > nutch commands directly. > > Dennis That seems to be working but there are so many documents I can't tell if the filter is working. Is there a way to verify it? I guess I could just parse the log output.

