On 30/11/2011 22:00, Ken Krugler wrote:
Normally when I see a 1M entry URL filter, it's doing domain-level filtering.
If that's the case, I'd use a BloomFilter, which has worked well for us in the
past during large-scale crawls.
There's an implementation of Bloom filter in Hadoop. Since the number of
items is known in advance it's possible to pick the right size of the
filter to keep the error rate at acceptable level.
One trick that you may consider when using Bloom filters is to have an
additional list of exceptions, i.e. common items that give false
positives. If you properly balance the size of the filter and the size
of the exception list you can still keep the total size of the structure
down while improving the error rate.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com