On Thu, May 27, 2010 at 10:47 AM, Grant Ingersoll <[email protected]>wrote:

> Hi,
>
> I'm running the Collocation stuff (
> https://cwiki.apache.org/confluence/display/MAHOUT/Collocations) and have
> a few questions.
>
> Here's what I am doing for now:
>
> I have the Reuters stuff as TXT files.  I convert that to a Seq File.  Then
> I'm running seq 2 sparse:
>  ./mahout seq2sparse --input ./content/reuters/seqfiles3 --output
> ./content/reuters/vectors2  --maxNGramSize 3
>
> I then want to index my content into Solr/Lucene and I wish to supplement
> the main content with a new field that contains the top collocations for
> each document.  I see a couple of things that I'm not sure of how to proceed
> with:
>
> 1. I need labels on the vectors so that I can look up/associate my input
> document with the appropriate vector that was created by Mahout.  It doesn't
> seem like Seq2Sparse supports NamedVector, so how would I do this?
>

Delroy/Jeff recently ran into this, but I'm having problems finding the
thread in the archive that I can link to. I'll open a jira with the patch
Jeff posted.


> 2. How can I, given a vector, get the top collocations for that Vector, as
> ranked by LLR?
>

If I recall correctly, the LLR score gets dropped in seq2sparse in favor of
TF or TFIDF depending on the nature of the vectors being generated.
Meanwhile, CollocDriver simply emits a list of collocations in a collection
ranked by llr, so neither is strictly what you are interested in. Is there a
good way to include both something like TF >and< LLR in the output of
seq2sparse -- would it be necessary to resort to emitting 2 separate sets of
vectors?

Am I off base in wanting to do something like this?
>

Not at all.

The alternative that's been discussed here in the past would involve some
custom analyzer work. The general idea is to load the output from the
CollocDriver into a bloom filter and then when processing documents at
indexing time, set up a field where you generate shingles and only index
those that appear in the bloom filter. This way you wind up getting a set of
ngrams indexed that are ranked high across the entire corpus instead of
simply the best ones for each document.

You'll want to take a look at the ngram list emitted from the CollocDriver,
ngrams composed of high frequecy terms tend to get a high LLR score. For
some of the work I've done, filtering out ngrams composed of two or more
terms in the StandardAnalyzer's stoplist worked pretty well although there
always seem to be corpus-specific high frequency terms worth filtering out
as well.

Hope this helps,

Drew

Reply via email to