You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality. > On 26 Aug 2015, at 12:35, Jaonary Rabarisoa <[email protected]> wrote: > > Dear all, > > I'm trying to find an efficient way to build a k-NN graph for a large > dataset. Precisely, I have a large set of high dimensional vector (say d >>> > 10000) and I want to build a graph where those high dimensional points are > the vertices and each one is linked to the k-nearest neighbor based on some > kind similarity defined on the vertex spaces. > My problem is to implement an efficient algorithm to compute the weight > matrix of the graph. I need to compute a N*N similarities and the only way I > know is to use "cartesian" operation follow by "map" operation on RDD. But, > this is very slow when the N is large. Is there a more cleaver way to do this > for an arbitrary similarity function ? > > Cheers, > > Jao
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
