I'm following the walkthrough at:

https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line

…to do some vector-space similarity comparisons on text documents with Mahout 
0.7. Everything's working well, but I'm losing my named vectors at the 
rowsimilarity stage. I can reconnect the texts to their original names via a 
post-processing script that reads docIndex, but I wonder if there isn't a 
missing step or other hiccup in the instructions.  

Obviously I'm importing the files with -nv and have verified that it's working 
correctly via seqdumper. The labels are also kept through the transformation 
from vectors to matrix via rowid (inspecting the matrix file confirms this.) 
They disappear, however, when I execute:

mahout rowsimilarity \
   -i named-matrix/matrix \
   -o named-similarity \
   -r [column number here]
   --similarityClassname SIMILARITY_COSINE
   -m 10
   -ess

… having been replaced with sequential integers. What's curious is that there's 
a bug in the walkthrough related to this issue: the rowid command outputs 
"reuters-matrix" but the next step, rowsimilarity, specifies 
"reuters-named-matrix" as its input.  So I'm wondering if there might have been 
an interstitial step that involved invoking docIndex somehow and reconnecting 
the numeric vectors to their labels? I can't find any command-line argument in 
rowsimilarity that would save the labels and not cause them to be discarded.

Thanks in advance for any pointers!

Reply via email to