On 30/11/2011 22:00, Ken Krugler wrote:
Normally when I see a 1M entry URL filter, it's doing domain-level filtering.

If that's the case, I'd use a BloomFilter, which has worked well for us in the 
past during large-scale crawls.

There's an implementation of Bloom filter in Hadoop. Since the number of items is known in advance it's possible to pick the right size of the filter to keep the error rate at acceptable level.

One trick that you may consider when using Bloom filters is to have an additional list of exceptions, i.e. common items that give false positives. If you properly balance the size of the filter and the size of the exception list you can still keep the total size of the structure down while improving the error rate.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to