If my seeds file contains only http://www.bizjournals.com/triangle/ and url-regexfilter.txt contains
# whitelist +^https?://www.bizjournals.com/triangle/blog/techflash<http://www.bizjournals.com/triangle/blog/techflash>/.* # blacklist -^https?://www.bizjournals.com/.*<http://www.bizjournals.com/.*> will nutch crawl http://www.bizjournals.com/triangle/blog/techflash/ ? The problem I’m trying to solve is that I want nutch to crawl http://www.bizjournals.com/triangle/news/ and http://www.bizjournals.com/triangle/blog/techflash/ but ignore other URLs within the site such as http://www.bizjournals.com/boston/, http://www.bizjournals.com/, and http://www.bizjournals.com/triangle/blog/ Does the whitelist patterns overrule the patterns in the blacklist ? Do I need a more complex regex pattern that will allow the subdirectories that I’m interested in crawling while preventing the parent directories of those subdirectories ? Scott Lundgren Software Engineer (704) 973-7388 [email protected]<mailto:[email protected]> QuietStream Financial, LLC<http://www.quietstreamfinancial.com> 11121 Carmel Commons Boulevard | Suite 250 Charlotte, North Carolina 28226 Our Portfolio of Commercial Real Estate Solutions: • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) • Fairview Real Estate Solutions<http://www.fairviewres.com/> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/> • Tax Credit Asset Management<http://www.tcamre.com/> • Radian Generation<http://www.radiangeneration.com/> • EntityKeeper<http://www.entitykeeper.com/>™ • Crowd With Ease<http://www.crowdwithease.com>™ • FullCapitalStack<http://www.fullcapitalstack.com>™ • CrowdRabbit<http://www.crowdrabbit.com>™

