Cartesian joins of large datasets are usually going to be slow. If there
is a way you can reduce the problem space to make sure you only join
subsets with each other, that may be helpful. Maybe if you explain your
problem in more detail, people on the list can come up with more
suggestions.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Fri, Oct 17, 2014 at 4:13 PM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Hi all,
>
> I need to compute a similiarity between elements of two large sets of high
> dimensional feature vector.
> Naively, I create all possible pair of vectors with
> * features1.cartesian(features2)* and then map the produced paired rdd
> with my similarity function.
>
> The problem is that the cartesian operation takes a lot times, more time
> that computing the similarity itself. If I save each of my  feature vector
> into disk, form a list of file name pair and compute the similarity by
> reading the files it runs significantly much faster.
>
> Any ideas will be helpful,
>
> Cheers,
>
> Jao
>
>
>
>

Reply via email to