Yeah that's the kind of thing I'm looking for; was looking at SPARK-4259 and poking around to see how to do things.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4259 > On Jan 17, 2015, at 8:35 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: > > Andrew, u would be better off using Mahout's RowSimilarityJob for what u r > trying to accomplish. > > 1. It does give u pair-wise distances > 2. U can specify the Distance measure u r looking to use > 3. There's the old MapReduce impl and the Spark DSL impl per ur preference. > > From: Andrew Musselman <andrew.mussel...@gmail.com> > To: Reza Zadeh <r...@databricks.com> > Cc: user <user@spark.apache.org> > Sent: Saturday, January 17, 2015 11:29 AM > Subject: Re: Row similarities > > Thanks Reza, interesting approach. I think what I actually want is to > calculate pair-wise distance, on second thought. Is there a pattern for that? > > > >> On Jan 16, 2015, at 9:53 PM, Reza Zadeh <r...@databricks.com> wrote: >> >> You can use K-means with a suitably large k. Each cluster should correspond >> to rows that are similar to one another. >> >> On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman >> <andrew.mussel...@gmail.com> wrote: >> What's a good way to calculate similarities between all vector-rows in a >> matrix or RDD[Vector]? >> >> I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm >> going down a good path to transpose a matrix in order to run that. > >