Grant, At LinkedIn, we do something very similar to what Drew is describing here as part of our content-based recommender: use effectively the CollocDriver to get the highest N (ranked by LLR) collocations in one job, then load those into a bloom filter which is inserted into a simple custom Analyzer which is chained with a ShingleAnalyzer to index (using lucene normalization, not LLR score!) the "good" phrases for each document.
There are various tricks and techniques for cleaning this up better, but even just the above does a pretty good job, and is very little on top of Mahout's current codebase. -jake On May 27, 2010 8:53 AM, "Drew Farris" <[email protected]> wrote: On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected] >wrote: > Hi, > > I'm running the Collocation stuff ( > https://cwiki.apache.org/confluence/display/MAHOUT/... Delroy/Jeff recently ran into this, but I'm having problems finding the thread in the archive that I can link to. I'll open a jira with the patch Jeff posted. > 2. How can I, given a vector, get the top collocations for that Vector, as > ranked by LLR? > If I recall correctly, the LLR score gets dropped in seq2sparse in favor of TF or TFIDF depending on the nature of the vectors being generated. Meanwhile, CollocDriver simply emits a list of collocations in a collection ranked by llr, so neither is strictly what you are interested in. Is there a good way to include both something like TF >and< LLR in the output of seq2sparse -- would it be necessary to resort to emitting 2 separate sets of vectors? Am I off base in wanting to do something like this? > Not at all. The alternative that's been discussed here in the past would involve some custom analyzer work. The general idea is to load the output from the CollocDriver into a bloom filter and then when processing documents at indexing time, set up a field where you generate shingles and only index those that appear in the bloom filter. This way you wind up getting a set of ngrams indexed that are ranked high across the entire corpus instead of simply the best ones for each document. You'll want to take a look at the ngram list emitted from the CollocDriver, ngrams composed of high frequecy terms tend to get a high LLR score. For some of the work I've done, filtering out ngrams composed of two or more terms in the StandardAnalyzer's stoplist worked pretty well although there always seem to be corpus-specific high frequency terms worth filtering out as well. Hope this helps, Drew
