Answering my own question here, please correct me if I'm wrong. In order for the entries in regex-urlfilter.txt to be relevant to your crawl and indexing, you need to manually edit 'bin/crawl' and remove -noFilter from the 'nutch generate' command.
Additionally, you need to edit the portion that calls 'nutch solrindex' and add '-filter' to the solrindex call. ________________________________ From: Os Tyler Sent: Tuesday, July 30, 2013 3:26 PM To: [email protected] Subject: regex-urlfilter test shows negative, but URL still crawled I have an entry in regex-urlfilter.txt designed to prevent crawling of urls that are part of our UPS search app. # skip URLs from the UPS search app -\?ups= -index.php/ups\?aa When I test the urls, it appears that regex-urlfilter should exclude them, for example: echo "http://redacted.com/index.php/ups?aa" | /usr/local/apache-nutch/bin/nutch org/apache/nutch/net/URLFilterChecker -filterName org.apache.nutch.urlfilter.regex.RegexURLFilter Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter -http://redacted.com/index.php/ups?aa But when I run 'crawl', it does not skip these urls. Thanks for any help in showing me what I'm missing here.

