What is the correct way to verify a pattern using URLFilterChecker after adding it to conf/regex-urlfilter.txt ? I know I’ll need rerun ant to get the conf change into the mapreduce job when the pattern excludes as I intend.
To conf/regex-urlfilter.txt before my whitelist I added: -.*cabinetobituaries/.* At a command prompt I run: runtime/deploy/bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined The output says "Checking combination of all URLFilters available” then I press enter and get the following 15/02/09 19:28:25 INFO plugin.PluginRepository: Plugins: looking in: /data/hadoop/hadoop-unjar490367744495018237/classes/plugins 15/02/09 19:28:25 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 15/02/09 19:28:25 INFO plugin.PluginRepository: Registered Plugins: <--- SNIP ---> 15/02/09 19:28:25 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) <--- SNIP ---> 15/02/09 19:28:25 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/data/hadoop/hadoop-unjar490367744495018237/regex-urlfilter.txt - If I enter -http://www.cabinet.com/cabinet/cabinetobituaries/1054824-435/robert-g.-judy.html then press enter the output is --http://www.cabinet.com/cabinet/cabinetobituaries/1054824-435/robert-g.-judy.html If I press enter without providing a URL then the output is (a blank line followed by a dash) - I’m not sure what to expect as a response or if that was passing or failure Scott Lundgren Software Engineer (704) 973-7388 slundg...@qsfllc.com<mailto:slundg...@qsfllc.com> QuietStream Financial, LLC<http://www.quietstreamfinancial.com> 11121 Carmel Commons Boulevard | Suite 250 Charlotte, North Carolina 28226 Our Portfolio of Commercial Real Estate Solutions: • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) • Fairview Real Estate Solutions<http://www.fairviewres.com/> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/> • Tax Credit Asset Management<http://www.tcamre.com/> • Radian Generation<http://www.radiangeneration.com/> • EntityKeeper<http://www.entitykeeper.com/>™ • Crowd With Ease<http://www.crowdwithease.com>™ • FullCapitalStack<http://www.fullcapitalstack.com>™ • CrowdRabbit<http://www.crowdrabbit.com>™