You can use one which comes with Hadoop
- org.apache.hadoop.util.bloom.DynamicBloomFilter. It is in the core jar.

On Thu, May 27, 2010 at 11:41 AM, Grant Ingersoll <[email protected]>wrote:

> So, do we have a Bloom Filter handy in Mahout?  I see a BSD licensed one at
> http://wwwse.inf.tu-dresden.de/xsiena/bloom_filter, but don't have any
> idea on it's performance.
>
> On May 27, 2010, at 2:27 PM, Jake Mannix wrote:
>
> > Grant,
> >
> >  At LinkedIn, we do something very similar to what Drew is describing
> here
> > as part of our content-based recommender: use effectively the
> CollocDriver
> > to get the highest N (ranked by LLR) collocations in one job, then load
> > those into a bloom filter which is inserted into a simple custom Analyzer
> > which is chained with a ShingleAnalyzer to index (using lucene
> > normalization, not LLR score!) the "good" phrases for each document.
> >
> >  There are various tricks and techniques for cleaning this up better, but
> > even just the above does a pretty good job, and is very little on top of
> > Mahout's current codebase.
> >
> >  -jake
> >
> >
> > On May 27, 2010 8:53 AM, "Drew Farris" <[email protected]> wrote:
> >
> > On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected]
> >> wrote:
> >
> >> Hi, > > I'm running the Collocation stuff ( >
> > https://cwiki.apache.org/confluence/display/MAHOUT/...
> > Delroy/Jeff recently ran into this, but I'm having problems finding the
> > thread in the archive that I can link to. I'll open a jira with the patch
> > Jeff posted.
> >
> >> 2. How can I, given a vector, get the top collocations for that Vector,
> as
> >> ranked by LLR? >
> > If I recall correctly, the LLR score gets dropped in seq2sparse in favor
> of
> > TF or TFIDF depending on the nature of the vectors being generated.
> > Meanwhile, CollocDriver simply emits a list of collocations in a
> collection
> > ranked by llr, so neither is strictly what you are interested in. Is
> there a
> > good way to include both something like TF >and< LLR in the output of
> > seq2sparse -- would it be necessary to resort to emitting 2 separate sets
> of
> > vectors?
> >
> > Am I off base in wanting to do something like this? >
> > Not at all.
> >
> > The alternative that's been discussed here in the past would involve some
> > custom analyzer work. The general idea is to load the output from the
> > CollocDriver into a bloom filter and then when processing documents at
> > indexing time, set up a field where you generate shingles and only index
> > those that appear in the bloom filter. This way you wind up getting a set
> of
> > ngrams indexed that are ranked high across the entire corpus instead of
> > simply the best ones for each document.
> >
> > You'll want to take a look at the ngram list emitted from the
> CollocDriver,
> > ngrams composed of high frequecy terms tend to get a high LLR score. For
> > some of the work I've done, filtering out ngrams composed of two or more
> > terms in the StandardAnalyzer's stoplist worked pretty well although
> there
> > always seem to be corpus-specific high frequency terms worth filtering
> out
> > as well.
> >
> > Hope this helps,
> >
> > Drew
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to