I understand from https://issues.apache.org/jira/browse/MAHOUT-767 that the rowsimilarity job was recently improved, to handle less-sparse input.
This week I had some fun using rowsimilarity with a matrix of items (books) and librarian-assigned topic codes, to generate (and finally prune) similarities that could be fed into Gephi for visualization. With relatively little hacking, and fairly modest initial data (100k items) it worked pretty fine, even on a laptop, and gave a rather geographical 'map' of books clustered by similar topics. See pretty pics and blather at http://danbri.org/words/2011/10/11/720 ... (btw I'm quite happy with what I got back from Gephi given < 1 day's time, and encourage others to investigate the tool.) So --- as I said in http://www.mail-archive.com/[email protected]/msg06602.html --- flushed with initial success, I revisited svd/lanczos looking for a more sophsticated analysis of the item/topic associations. Initially I hit some boring problems with the rowid job that was needed before I could transpose. But getting past that, I am now trying to run 'rowsimilarity' against the output of Lanczos, where my rows are items (TV shows this time, but may as well be books). And my columns are SVD-reorganized view of topic space. I started this way (after having seqdirectory'd, then rowid'd, transposed the original input): mahout rowsimilarity --input lonclassland/postsvd/transpose-128/part-00000 --output lonclassland/svdsims2 -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=18 --numberOfColumns 280 --similarityClassname SIMILARITY_LOGLIKELIHOOD ...before reading/realising that rowsimilarity prefers sparse data. And indeed its progress running on a cluster seems glacial. Looking at MAHOUT-767 and 'bin/mahout rowsimilarity --help', I don't see any obvious way forward. * Would chosing a different similarity measure make any big difference? (I'd guess for cosine...) * should I experiment with values for --threshold ? * Or somehow try to "re-sparsify" the input first? If I read the output of 'mahout seqdumper --seqFile postsvd/transpose-128/part-00000 ' correctly, there are many very small values; can they be approximated to zero and discarded somehow? My high level goal is, for each item, to find a handful of the most similar items, and then feed that to Gephi to generate topical maps of the 'item landscape', grouping like items together. My intuition was that doing this post-SVD might give a deeper insight into what the bulk of these item/topic associations tell us, compared to doing rowsimilarity against the raw item/topic matrix. However the tool I found to explore this, rowsimilarity, seems to not to be the right fit here. Thanks for any pointers, Dan
