Re: Blacklisting TLDs

Sebastian Nagel Sun, 17 Jun 2018 13:31:04 -0700

Hi Michael,

on the Common Crawl Nutch fork there is a plugin "fast-urlfilter" which does 
this, see


https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java

It uses exactly this concept of "domain", i.e., a suffix of host name parts. 
You would have to write
your rules as

Domain cn
 DenyPath .*

Domain line.me
 DenyPath .*


The name "fast" is not really informative. It's usually faster than 
regex-urlfilter for two reasons:
- regex rules are per host or "domain"
- matching regex patterns can be expensive on long strings. By limit the match 
on path only
  strings get somewhat shorter

I'll plan to push this url filter to the main branch of Nutch. Currently, the 
filter can hold
several 100,000s of denied domains. I've also used it with 2 million but then 
at least 2 GB
of memory are recommended for the Nutch tasks.  I hope to scale it up by 
replacing the hash
to hold the domains by a trie or automaton.


Best,
Sebastian




On 06/14/2018 07:46 PM, Michael Coffey wrote:
> I want to blacklist certain top-level domains for a very large web crawl. I 
> tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't 
> seem to work.
> 
> My domainblacklist-urlfilter.txt contains lines like the following.
> 
> cn
> jp
> line.me
> albooked.com
> booked.co.il
> 
> 
> The TLDs do not get blocked, but the other listed domains do get blocked.
> 
> I suppose I could compose regexes, but that is trick to do accurately because 
> I don't want to block urls that happen to have ".cn" or '.jp" in the middle 
> of them.
> 
> Would I need to change the source code of DomainBlacklistUrlFilter, or is 
> there an easier solution?
>

Re: Blacklisting TLDs

Reply via email to