Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

Reza Zadeh Fri, 17 Oct 2014 14:09:39 -0700

Hi Jaonary,

What are the numbers, i.e. number of points you're trying to do all-pairs
on, and the dimension of each?

Have you tried the new implementation of columnSimilarities in RowMatrix?
Setting the threshold high enough (potentially above 1.0) might solve your
problem, here is an example
<https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala>
.

This implements the DIMSUM sampling scheme, recently merged into master
<https://github.com/apache/spark/pull/1778>.

Best,
Reza

On Fri, Oct 17, 2014 at 3:43 AM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Hi all,
>
> I need to compute a similiarity between elements of two large sets of high
> dimensional feature vector.
> Naively, I create all possible pair of vectors with
> * features1.cartesian(features2)* and then map the produced paired rdd
> with my similarity function.
>
> The problem is that the cartesian operation takes a lot times, more time
> that computing the similarity itself. If I save each of my  feature vector
> into disk, form a list of file name pair and compute the similarity by
> reading the files it runs significantly much faster.
>
> Any ideas will be helpful,
>
> Cheers,
>
> Jao
>
>
>
>

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

Reply via email to