You can use one which comes with Hadoop - org.apache.hadoop.util.bloom.DynamicBloomFilter. It is in the core jar.
On Thu, May 27, 2010 at 11:41 AM, Grant Ingersoll <[email protected]>wrote: > So, do we have a Bloom Filter handy in Mahout? I see a BSD licensed one at > http://wwwse.inf.tu-dresden.de/xsiena/bloom_filter, but don't have any > idea on it's performance. > > On May 27, 2010, at 2:27 PM, Jake Mannix wrote: > > > Grant, > > > > At LinkedIn, we do something very similar to what Drew is describing > here > > as part of our content-based recommender: use effectively the > CollocDriver > > to get the highest N (ranked by LLR) collocations in one job, then load > > those into a bloom filter which is inserted into a simple custom Analyzer > > which is chained with a ShingleAnalyzer to index (using lucene > > normalization, not LLR score!) the "good" phrases for each document. > > > > There are various tricks and techniques for cleaning this up better, but > > even just the above does a pretty good job, and is very little on top of > > Mahout's current codebase. > > > > -jake > > > > > > On May 27, 2010 8:53 AM, "Drew Farris" <[email protected]> wrote: > > > > On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected] > >> wrote: > > > >> Hi, > > I'm running the Collocation stuff ( > > > https://cwiki.apache.org/confluence/display/MAHOUT/... > > Delroy/Jeff recently ran into this, but I'm having problems finding the > > thread in the archive that I can link to. I'll open a jira with the patch > > Jeff posted. > > > >> 2. How can I, given a vector, get the top collocations for that Vector, > as > >> ranked by LLR? > > > If I recall correctly, the LLR score gets dropped in seq2sparse in favor > of > > TF or TFIDF depending on the nature of the vectors being generated. > > Meanwhile, CollocDriver simply emits a list of collocations in a > collection > > ranked by llr, so neither is strictly what you are interested in. Is > there a > > good way to include both something like TF >and< LLR in the output of > > seq2sparse -- would it be necessary to resort to emitting 2 separate sets > of > > vectors? > > > > Am I off base in wanting to do something like this? > > > Not at all. > > > > The alternative that's been discussed here in the past would involve some > > custom analyzer work. The general idea is to load the output from the > > CollocDriver into a bloom filter and then when processing documents at > > indexing time, set up a field where you generate shingles and only index > > those that appear in the bloom filter. This way you wind up getting a set > of > > ngrams indexed that are ranked high across the entire corpus instead of > > simply the best ones for each document. > > > > You'll want to take a look at the ngram list emitted from the > CollocDriver, > > ngrams composed of high frequecy terms tend to get a high LLR score. For > > some of the work I've done, filtering out ngrams composed of two or more > > terms in the StandardAnalyzer's stoplist worked pretty well although > there > > always seem to be corpus-specific high frequency terms worth filtering > out > > as well. > > > > Hope this helps, > > > > Drew > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >
