So, do we have a Bloom Filter handy in Mahout? I see a BSD licensed one at http://wwwse.inf.tu-dresden.de/xsiena/bloom_filter, but don't have any idea on it's performance.
On May 27, 2010, at 2:27 PM, Jake Mannix wrote: > Grant, > > At LinkedIn, we do something very similar to what Drew is describing here > as part of our content-based recommender: use effectively the CollocDriver > to get the highest N (ranked by LLR) collocations in one job, then load > those into a bloom filter which is inserted into a simple custom Analyzer > which is chained with a ShingleAnalyzer to index (using lucene > normalization, not LLR score!) the "good" phrases for each document. > > There are various tricks and techniques for cleaning this up better, but > even just the above does a pretty good job, and is very little on top of > Mahout's current codebase. > > -jake > > > On May 27, 2010 8:53 AM, "Drew Farris" <[email protected]> wrote: > > On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected] >> wrote: > >> Hi, > > I'm running the Collocation stuff ( > > https://cwiki.apache.org/confluence/display/MAHOUT/... > Delroy/Jeff recently ran into this, but I'm having problems finding the > thread in the archive that I can link to. I'll open a jira with the patch > Jeff posted. > >> 2. How can I, given a vector, get the top collocations for that Vector, as >> ranked by LLR? > > If I recall correctly, the LLR score gets dropped in seq2sparse in favor of > TF or TFIDF depending on the nature of the vectors being generated. > Meanwhile, CollocDriver simply emits a list of collocations in a collection > ranked by llr, so neither is strictly what you are interested in. Is there a > good way to include both something like TF >and< LLR in the output of > seq2sparse -- would it be necessary to resort to emitting 2 separate sets of > vectors? > > Am I off base in wanting to do something like this? > > Not at all. > > The alternative that's been discussed here in the past would involve some > custom analyzer work. The general idea is to load the output from the > CollocDriver into a bloom filter and then when processing documents at > indexing time, set up a field where you generate shingles and only index > those that appear in the bloom filter. This way you wind up getting a set of > ngrams indexed that are ranked high across the entire corpus instead of > simply the best ones for each document. > > You'll want to take a look at the ngram list emitted from the CollocDriver, > ngrams composed of high frequecy terms tend to get a high LLR score. For > some of the work I've done, filtering out ngrams composed of two or more > terms in the StandardAnalyzer's stoplist worked pretty well although there > always seem to be corpus-specific high frequency terms worth filtering out > as well. > > Hope this helps, > > Drew -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
