url-regexfilter & directory based sites

Scott Lundgren Wed, 25 Mar 2015 17:07:30 -0700

If my seeds file contains only http://www.bizjournals.com/triangle/ and 
url-regexfilter.txt contains


# whitelist
+^https?://www.bizjournals.com/triangle/blog/techflash<http://www.bizjournals.com/triangle/blog/techflash>/.*
# blacklist
-^https?://www.bizjournals.com/.*<http://www.bizjournals.com/.*>

will nutch crawl http://www.bizjournals.com/triangle/blog/techflash/ ?

The problem I’m trying to solve is that I want nutch to crawl 
http://www.bizjournals.com/triangle/news/ and 
http://www.bizjournals.com/triangle/blog/techflash/ but ignore other URLs 
within the site such as http://www.bizjournals.com/boston/, 
http://www.bizjournals.com/, and http://www.bizjournals.com/triangle/blog/

Does the whitelist patterns overrule the patterns in the blacklist ? Do I need 
a more complex regex pattern that will allow the subdirectories that I’m 
interested in crawling while preventing the parent directories of those 
subdirectories ?

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial 
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

url-regexfilter & directory based sites

Reply via email to