You may still see some urls that *seem* to be outside of your domains list while using the domain urlfilter. Remember the following:

  1. Urls are checked in order of domain suffix, domain name, and
     hostname.  If you have .com and something.net, urls in
     something.com will also get picked up.
  2. This doesn't handle redirects, it only handles generated urls.  If
     your domain urls file has something.com and the original url is
     http://something.com/something.html but redirects to
     http://ww2.something.net/redirect/login.html for example, the url
     will still get crawled and saved.

For verification grep through the logs to be sure. Be aware of the redirects if you see a few urls that don't match your patterns. If you see a lot that don't match then something isn't working.

Dennis


On 06/23/2010 02:52 PM, Max Lynch wrote:
On Tue, Jun 22, 2010 at 10:35 PM, Dennis Kubes<[email protected]>  wrote:

Try using the DomainUrlFilter.  You will need to do the following:

  1. Activate the domain urlfilter in plugin.includes,
     urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
  2. In the conf directory add your domains one per line to the
     domain-urlfilter.txt file.  Entries can be domains
     (something.com), subdomains (www.something.com), or top level
     identifiers (.com)

This should work using both the crawl command and calling the individual
nutch commands directly.

Dennis

That seems to be working but there are so many documents I can't tell if the
filter is working.  Is there a way to verify it?  I guess I could just parse
the log output.

Reply via email to