Hi, I'm running the Collocation stuff (https://cwiki.apache.org/confluence/display/MAHOUT/Collocations) and have a few questions.
Here's what I am doing for now: I have the Reuters stuff as TXT files. I convert that to a Seq File. Then I'm running seq 2 sparse: ./mahout seq2sparse --input ./content/reuters/seqfiles3 --output ./content/reuters/vectors2 --maxNGramSize 3 I then want to index my content into Solr/Lucene and I wish to supplement the main content with a new field that contains the top collocations for each document. I see a couple of things that I'm not sure of how to proceed with: 1. I need labels on the vectors so that I can look up/associate my input document with the appropriate vector that was created by Mahout. It doesn't seem like Seq2Sparse supports NamedVector, so how would I do this? 2. How can I, given a vector, get the top collocations for that Vector, as ranked by LLR? Perhaps I should be using the CollocDriver directly? Am I off base in wanting to do something like this? Thanks, Grant
