I want to blacklist certain top-level domains for a very large web crawl. I 
tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem 
to work.

My domainblacklist-urlfilter.txt contains lines like the following.

cn
jp
line.me
albooked.com
booked.co.il


The TLDs do not get blocked, but the other listed domains do get blocked.

I suppose I could compose regexes, but that is trick to do accurately because I 
don't want to block urls that happen to have ".cn" or '.jp" in the middle of 
them.

Would I need to change the source code of DomainBlacklistUrlFilter, or is there 
an easier solution?

Reply via email to