Andrew, u would be better off using Mahout's RowSimilarityJob for what u r 
trying to accomplish.

 1.  It does give u pair-wise distances 2.  U can specify the Distance measure 
u r looking to use 3.  There's the old MapReduce impl and the Spark DSL impl 
per ur preference.

      From: Andrew Musselman <andrew.mussel...@gmail.com>
 To: Reza Zadeh <r...@databricks.com> 
Cc: user <user@spark.apache.org> 
 Sent: Saturday, January 17, 2015 11:29 AM
 Subject: Re: Row similarities
   
Thanks Reza, interesting approach.  I think what I actually want is to 
calculate pair-wise distance, on second thought.  Is there a pattern for that?


On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote:


You can use K-means with a suitably large k. Each cluster should correspond to 
rows that are similar to one another.
On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman <andrew.mussel...@gmail.com> 
wrote:

What's a good way to calculate similarities between all vector-rows in a matrix 
or RDD[Vector]?

I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm going 
down a good path to transpose a matrix in order to run that.





  

Reply via email to