Re: Very large filter lists

Andrzej Bialecki Wed, 30 Nov 2011 13:42:36 -0800

On 30/11/2011 22:00, Ken Krugler wrote:

Normally when I see a 1M entry URL filter, it's doing domain-level filtering.


If that's the case, I'd use a BloomFilter, which has worked well for us in the 
past during large-scale crawls.

There's an implementation of Bloom filter in Hadoop. Since the number ofitems is known in advance it's possible to pick the right size of thefilter to keep the error rate at acceptable level.

One trick that you may consider when using Bloom filters is to have anadditional list of exceptions, i.e. common items that give falsepositives. If you properly balance the size of the filter and the sizeof the exception list you can still keep the total size of the structuredown while improving the error rate.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Very large filter lists

Reply via email to