I want to blacklist certain top-level domains for a very large web crawl. I tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem to work.
My domainblacklist-urlfilter.txt contains lines like the following. cn jp line.me albooked.com booked.co.il The TLDs do not get blocked, but the other listed domains do get blocked. I suppose I could compose regexes, but that is trick to do accurately because I don't want to block urls that happen to have ".cn" or '.jp" in the middle of them. Would I need to change the source code of DomainBlacklistUrlFilter, or is there an easier solution?

