Re: rowsimilarity and non-sparse input

Sebastian Schelter Mon, 17 Oct 2011 05:48:04 -0700

RowSimilarityJob will have quadratic runtime for dense input and might
generate large intermediate outputs. I'd argue against using it for such
purposes.


--sebastian

On 17.10.2011 14:43, Dan Brickley wrote:
> I understand from https://issues.apache.org/jira/browse/MAHOUT-767
> that the rowsimilarity job was recently improved, to handle
> less-sparse input.
> 
> This week I had some fun using rowsimilarity with a matrix of items
> (books) and librarian-assigned topic codes, to generate (and finally
> prune) similarities that could be fed into Gephi for visualization.
> With relatively little hacking, and fairly modest initial data (100k
> items) it worked pretty fine, even on a laptop, and gave a rather
> geographical 'map' of books clustered by similar topics.
> 
> See pretty pics and blather at http://danbri.org/words/2011/10/11/720
> ... (btw I'm quite happy with what I got back from Gephi given < 1
> day's time, and encourage others to investigate the tool.)
> 
> So --- as I said in
> http://www.mail-archive.com/[email protected]/msg06602.html ---
> flushed with initial success, I revisited svd/lanczos looking for a
> more sophsticated analysis of the item/topic associations. Initially I
> hit some boring problems with the rowid job that was needed before I
> could transpose. But getting past that, I am now trying to run
> 'rowsimilarity' against the output of Lanczos, where my rows are items
> (TV shows this time, but may as well be books). And my columns are
> SVD-reorganized view of topic space.
> 
> I started this way (after having seqdirectory'd, then rowid'd,
> transposed the original input):
> 
> mahout rowsimilarity --input
> lonclassland/postsvd/transpose-128/part-00000  --output
> lonclassland/svdsims2
> -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=18  --numberOfColumns 280
> --similarityClassname SIMILARITY_LOGLIKELIHOOD
> 
> ...before reading/realising that rowsimilarity prefers sparse data.
> And indeed its progress running on a cluster seems glacial.
> 
> Looking at MAHOUT-767 and 'bin/mahout rowsimilarity --help', I don't
> see any obvious way forward.
> 
> * Would chosing a different similarity measure make any big
> difference? (I'd guess for cosine...)
> * should I experiment with values for  --threshold  ?
> * Or somehow try to "re-sparsify" the input first? If I read the
> output of 'mahout seqdumper --seqFile postsvd/transpose-128/part-00000
> ' correctly, there are many very small values; can they be
> approximated to zero and discarded somehow?
> 
> My high level goal is, for each item, to find a handful of the most
> similar items, and then feed that to Gephi to generate topical maps of
> the 'item landscape', grouping like items together. My intuition was
> that doing this post-SVD might give a deeper insight into what the
> bulk of these item/topic associations tell us, compared to doing
> rowsimilarity against the raw item/topic matrix. However the tool I
> found to explore this, rowsimilarity, seems to not to be the right fit
> here.
> 
> Thanks for any pointers,
> 
> Dan

Re: rowsimilarity and non-sparse input

Reply via email to