So, do we have a Bloom Filter handy in Mahout?  I see a BSD licensed one at 
http://wwwse.inf.tu-dresden.de/xsiena/bloom_filter, but don't have any idea on 
it's performance.

On May 27, 2010, at 2:27 PM, Jake Mannix wrote:

> Grant,
> 
>  At LinkedIn, we do something very similar to what Drew is describing here
> as part of our content-based recommender: use effectively the CollocDriver
> to get the highest N (ranked by LLR) collocations in one job, then load
> those into a bloom filter which is inserted into a simple custom Analyzer
> which is chained with a ShingleAnalyzer to index (using lucene
> normalization, not LLR score!) the "good" phrases for each document.
> 
>  There are various tricks and techniques for cleaning this up better, but
> even just the above does a pretty good job, and is very little on top of
> Mahout's current codebase.
> 
>  -jake
> 
> 
> On May 27, 2010 8:53 AM, "Drew Farris" <[email protected]> wrote:
> 
> On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected]
>> wrote:
> 
>> Hi, > > I'm running the Collocation stuff ( >
> https://cwiki.apache.org/confluence/display/MAHOUT/...
> Delroy/Jeff recently ran into this, but I'm having problems finding the
> thread in the archive that I can link to. I'll open a jira with the patch
> Jeff posted.
> 
>> 2. How can I, given a vector, get the top collocations for that Vector, as
>> ranked by LLR? >
> If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
> TF or TFIDF depending on the nature of the vectors being generated.
> Meanwhile, CollocDriver simply emits a list of collocations in a collection
> ranked by llr, so neither is strictly what you are interested in. Is there a
> good way to include both something like TF >and< LLR in the output of
> seq2sparse -- would it be necessary to resort to emitting 2 separate sets of
> vectors?
> 
> Am I off base in wanting to do something like this? >
> Not at all.
> 
> The alternative that's been discussed here in the past would involve some
> custom analyzer work. The general idea is to load the output from the
> CollocDriver into a bloom filter and then when processing documents at
> indexing time, set up a field where you generate shingles and only index
> those that appear in the bloom filter. This way you wind up getting a set of
> ngrams indexed that are ranked high across the entire corpus instead of
> simply the best ones for each document.
> 
> You'll want to take a look at the ngram list emitted from the CollocDriver,
> ngrams composed of high frequecy terms tend to get a high LLR score. For
> some of the work I've done, filtering out ngrams composed of two or more
> terms in the StandardAnalyzer's stoplist worked pretty well although there
> always seem to be corpus-specific high frequency terms worth filtering out
> as well.
> 
> Hope this helps,
> 
> Drew

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to