Hi Michael, on the Common Crawl Nutch fork there is a plugin "fast-urlfilter" which does this, see
https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java It uses exactly this concept of "domain", i.e., a suffix of host name parts. You would have to write your rules as Domain cn DenyPath .* Domain line.me DenyPath .* The name "fast" is not really informative. It's usually faster than regex-urlfilter for two reasons: - regex rules are per host or "domain" - matching regex patterns can be expensive on long strings. By limit the match on path only strings get somewhat shorter I'll plan to push this url filter to the main branch of Nutch. Currently, the filter can hold several 100,000s of denied domains. I've also used it with 2 million but then at least 2 GB of memory are recommended for the Nutch tasks. I hope to scale it up by replacing the hash to hold the domains by a trie or automaton. Best, Sebastian On 06/14/2018 07:46 PM, Michael Coffey wrote: > I want to blacklist certain top-level domains for a very large web crawl. I > tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't > seem to work. > > My domainblacklist-urlfilter.txt contains lines like the following. > > cn > jp > line.me > albooked.com > booked.co.il > > > The TLDs do not get blocked, but the other listed domains do get blocked. > > I suppose I could compose regexes, but that is trick to do accurately because > I don't want to block urls that happen to have ".cn" or '.jp" in the middle > of them. > > Would I need to change the source code of DomainBlacklistUrlFilter, or is > there an easier solution? >

