Re: Very large filter lists

Markus Jelsma Tue, 06 Dec 2011 03:56:42 -0800


On Monday 05 December 2011 18:37:25 Markus Jelsma wrote:
> We use bloom filters as well but instead of having a domain filter, for
> which a bloom filter would be a good choice, we have a sub domain
> normalizer. We need to look-up a key and get something back.
> 
> Now, i've checked the code again and both normalizers, filters are
> instantiated in each thread. This causes significant additional heap space.
> 
> Are there any objections for sharing them between threads? I assume things
> will get a lot slower. Or could i just share the HashMap between instances?
> Suggestions?


Well, i remembered some pieces of concurrency in Java. The map is now static 
final and the method building the structure synchronized and checking if it 
has to rebuild the map. Seems to run fine. It is a plain HashMap because it is 
read-only so there is no need to ConcurrentHashMap.

> 
> This is about a custom fetcher that does parsing and outlink processing as
> well.
> 
> On Wednesday 30 November 2011 22:41:58 Andrzej Bialecki wrote:
> > There's an implementation of Bloom filter in Hadoop. Since the number of
> > items is known in advance it's possible to pick the right size of the
> > filter to keep the error rate at acceptable level.
> > 
> > One trick that you may consider when using Bloom filters is to have an
> > additional list of exceptions, i.e. common items that give false
> > positives. If you properly balance the size of the filter and the size
> > of the exception list you can still keep the total size of the structure
> > down while improving the error rate.

-- 
Markus Jelsma - CTO - Openindex

Re: Very large filter lists

Reply via email to