I'm following the walkthrough at: https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line
…to do some vector-space similarity comparisons on text documents with Mahout 0.7. Everything's working well, but I'm losing my named vectors at the rowsimilarity stage. I can reconnect the texts to their original names via a post-processing script that reads docIndex, but I wonder if there isn't a missing step or other hiccup in the instructions. Obviously I'm importing the files with -nv and have verified that it's working correctly via seqdumper. The labels are also kept through the transformation from vectors to matrix via rowid (inspecting the matrix file confirms this.) They disappear, however, when I execute: mahout rowsimilarity \ -i named-matrix/matrix \ -o named-similarity \ -r [column number here] --similarityClassname SIMILARITY_COSINE -m 10 -ess … having been replaced with sequential integers. What's curious is that there's a bug in the walkthrough related to this issue: the rowid command outputs "reuters-matrix" but the next step, rowsimilarity, specifies "reuters-named-matrix" as its input. So I'm wondering if there might have been an interstitial step that involved invoking docIndex somehow and reconnecting the numeric vectors to their labels? I can't find any command-line argument in rowsimilarity that would save the labels and not cause them to be discarded. Thanks in advance for any pointers!
