rowsimilarity and non-sparse input

Dan Brickley Mon, 17 Oct 2011 05:44:18 -0700

I understand from https://issues.apache.org/jira/browse/MAHOUT-767
that the rowsimilarity job was recently improved, to handle
less-sparse input.


This week I had some fun using rowsimilarity with a matrix of items
(books) and librarian-assigned topic codes, to generate (and finally
prune) similarities that could be fed into Gephi for visualization.
With relatively little hacking, and fairly modest initial data (100k
items) it worked pretty fine, even on a laptop, and gave a rather
geographical 'map' of books clustered by similar topics.

See pretty pics and blather at http://danbri.org/words/2011/10/11/720
... (btw I'm quite happy with what I got back from Gephi given < 1
day's time, and encourage others to investigate the tool.)

So --- as I said in
http://www.mail-archive.com/[email protected]/msg06602.html ---
flushed with initial success, I revisited svd/lanczos looking for a
more sophsticated analysis of the item/topic associations. Initially I
hit some boring problems with the rowid job that was needed before I
could transpose. But getting past that, I am now trying to run
'rowsimilarity' against the output of Lanczos, where my rows are items
(TV shows this time, but may as well be books). And my columns are
SVD-reorganized view of topic space.

I started this way (after having seqdirectory'd, then rowid'd,
transposed the original input):

mahout rowsimilarity --input
lonclassland/postsvd/transpose-128/part-00000  --output
lonclassland/svdsims2
-Dmapred.map.tasks=18 -Dmapred.reduce.tasks=18  --numberOfColumns 280
--similarityClassname SIMILARITY_LOGLIKELIHOOD

...before reading/realising that rowsimilarity prefers sparse data.
And indeed its progress running on a cluster seems glacial.

Looking at MAHOUT-767 and 'bin/mahout rowsimilarity --help', I don't
see any obvious way forward.

* Would chosing a different similarity measure make any big
difference? (I'd guess for cosine...)
* should I experiment with values for  --threshold  ?
* Or somehow try to "re-sparsify" the input first? If I read the
output of 'mahout seqdumper --seqFile postsvd/transpose-128/part-00000
' correctly, there are many very small values; can they be
approximated to zero and discarded somehow?

My high level goal is, for each item, to find a handful of the most
similar items, and then feed that to Gephi to generate topical maps of
the 'item landscape', grouping like items together. My intuition was
that doing this post-SVD might give a deeper insight into what the
bulk of these item/topic associations tell us, compared to doing
rowsimilarity against the raw item/topic matrix. However the tool I
found to explore this, rowsimilarity, seems to not to be the right fit
here.

Thanks for any pointers,

Dan

rowsimilarity and non-sparse input

Reply via email to