Collocation and Seq2Sparse Questions

Grant Ingersoll Thu, 27 May 2010 07:47:54 -0700

Hi,

I'm running the Collocation stuff 
(https://cwiki.apache.org/confluence/display/MAHOUT/Collocations) and have a 
few questions.


Here's what I am doing for now:

I have the Reuters stuff as TXT files.  I convert that to a Seq File.  Then I'm 
running seq 2 sparse:
 ./mahout seq2sparse --input ./content/reuters/seqfiles3 --output 
./content/reuters/vectors2  --maxNGramSize 3

I then want to index my content into Solr/Lucene and I wish to supplement the 
main content with a new field that contains the top collocations for each 
document.  I see a couple of things that I'm not sure of how to proceed with:

1. I need labels on the vectors so that I can look up/associate my input 
document with the appropriate vector that was created by Mahout.  It doesn't 
seem like Seq2Sparse supports NamedVector, so how would I do this?

2. How can I, given a vector, get the top collocations for that Vector, as 
ranked by LLR?

Perhaps I should be using the CollocDriver directly?

Am I off base in wanting to do something like this? 

Thanks,
Grant

Collocation and Seq2Sparse Questions

Reply via email to