RowSimilarityJob will have quadratic runtime for dense input and might generate large intermediate outputs. I'd argue against using it for such purposes.
--sebastian On 17.10.2011 14:43, Dan Brickley wrote: > I understand from https://issues.apache.org/jira/browse/MAHOUT-767 > that the rowsimilarity job was recently improved, to handle > less-sparse input. > > This week I had some fun using rowsimilarity with a matrix of items > (books) and librarian-assigned topic codes, to generate (and finally > prune) similarities that could be fed into Gephi for visualization. > With relatively little hacking, and fairly modest initial data (100k > items) it worked pretty fine, even on a laptop, and gave a rather > geographical 'map' of books clustered by similar topics. > > See pretty pics and blather at http://danbri.org/words/2011/10/11/720 > ... (btw I'm quite happy with what I got back from Gephi given < 1 > day's time, and encourage others to investigate the tool.) > > So --- as I said in > http://www.mail-archive.com/[email protected]/msg06602.html --- > flushed with initial success, I revisited svd/lanczos looking for a > more sophsticated analysis of the item/topic associations. Initially I > hit some boring problems with the rowid job that was needed before I > could transpose. But getting past that, I am now trying to run > 'rowsimilarity' against the output of Lanczos, where my rows are items > (TV shows this time, but may as well be books). And my columns are > SVD-reorganized view of topic space. > > I started this way (after having seqdirectory'd, then rowid'd, > transposed the original input): > > mahout rowsimilarity --input > lonclassland/postsvd/transpose-128/part-00000 --output > lonclassland/svdsims2 > -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=18 --numberOfColumns 280 > --similarityClassname SIMILARITY_LOGLIKELIHOOD > > ...before reading/realising that rowsimilarity prefers sparse data. > And indeed its progress running on a cluster seems glacial. > > Looking at MAHOUT-767 and 'bin/mahout rowsimilarity --help', I don't > see any obvious way forward. > > * Would chosing a different similarity measure make any big > difference? (I'd guess for cosine...) > * should I experiment with values for --threshold ? > * Or somehow try to "re-sparsify" the input first? If I read the > output of 'mahout seqdumper --seqFile postsvd/transpose-128/part-00000 > ' correctly, there are many very small values; can they be > approximated to zero and discarded somehow? > > My high level goal is, for each item, to find a handful of the most > similar items, and then feed that to Gephi to generate topical maps of > the 'item landscape', grouping like items together. My intuition was > that doing this post-SVD might give a deeper insight into what the > bulk of these item/topic associations tell us, compared to doing > rowsimilarity against the raw item/topic matrix. However the tool I > found to explore this, rowsimilarity, seems to not to be the right fit > here. > > Thanks for any pointers, > > Dan
