You may still see some urls that *seem* to be outside of your domains
list while using the domain urlfilter. Remember the following:
1. Urls are checked in order of domain suffix, domain name, and
hostname. If you have .com and something.net, urls in
something.com will also get picked up.
2. This doesn't handle redirects, it only handles generated urls. If
your domain urls file has something.com and the original url is
http://something.com/something.html but redirects to
http://ww2.something.net/redirect/login.html for example, the url
will still get crawled and saved.
For verification grep through the logs to be sure. Be aware of the
redirects if you see a few urls that don't match your patterns. If you
see a lot that don't match then something isn't working.
Dennis
On 06/23/2010 02:52 PM, Max Lynch wrote:
On Tue, Jun 22, 2010 at 10:35 PM, Dennis Kubes<[email protected]> wrote:
Try using the DomainUrlFilter. You will need to do the following:
1. Activate the domain urlfilter in plugin.includes,
urlfilter-(prefix|suffix|domain)... in the nutch-site.xml file.
2. In the conf directory add your domains one per line to the
domain-urlfilter.txt file. Entries can be domains
(something.com), subdomains (www.something.com), or top level
identifiers (.com)
This should work using both the crawl command and calling the individual
nutch commands directly.
Dennis
That seems to be working but there are so many documents I can't tell if the
filter is working. Is there a way to verify it? I guess I could just parse
the log output.