Grant,

  At LinkedIn, we do something very similar to what Drew is describing here
as part of our content-based recommender: use effectively the CollocDriver
to get the highest N (ranked by LLR) collocations in one job, then load
those into a bloom filter which is inserted into a simple custom Analyzer
which is chained with a ShingleAnalyzer to index (using lucene
normalization, not LLR score!) the "good" phrases for each document.

  There are various tricks and techniques for cleaning this up better, but
even just the above does a pretty good job, and is very little on top of
Mahout's current codebase.

  -jake


On May 27, 2010 8:53 AM, "Drew Farris" <[email protected]> wrote:

On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected]
>wrote:

> Hi, > > I'm running the Collocation stuff ( >
https://cwiki.apache.org/confluence/display/MAHOUT/...
Delroy/Jeff recently ran into this, but I'm having problems finding the
thread in the archive that I can link to. I'll open a jira with the patch
Jeff posted.

> 2. How can I, given a vector, get the top collocations for that Vector, as
> ranked by LLR? >
If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
TF or TFIDF depending on the nature of the vectors being generated.
Meanwhile, CollocDriver simply emits a list of collocations in a collection
ranked by llr, so neither is strictly what you are interested in. Is there a
good way to include both something like TF >and< LLR in the output of
seq2sparse -- would it be necessary to resort to emitting 2 separate sets of
vectors?

Am I off base in wanting to do something like this? >
Not at all.

The alternative that's been discussed here in the past would involve some
custom analyzer work. The general idea is to load the output from the
CollocDriver into a bloom filter and then when processing documents at
indexing time, set up a field where you generate shingles and only index
those that appear in the bloom filter. This way you wind up getting a set of
ngrams indexed that are ranked high across the entire corpus instead of
simply the best ones for each document.

You'll want to take a look at the ngram list emitted from the CollocDriver,
ngrams composed of high frequecy terms tend to get a high LLR score. For
some of the work I've done, filtering out ngrams composed of two or more
terms in the StandardAnalyzer's stoplist worked pretty well although there
always seem to be corpus-specific high frequency terms worth filtering out
as well.

Hope this helps,

Drew

Reply via email to