Thank you all for these links. I'll check them. On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack <[email protected]> wrote:
> +1 to all of the above esp. Dimensionality reduction and locality > sensitive hashing / min hashing. > > There's also an algorithm implemented in MLlib called DIMSUM which was > developed at Twitter for this purpose. I've been meaning to try it and > would be interested to hear about results you get. > > https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum > > Charlie > > > — Sent from Mailbox > > On Wednesday, Aug 26, 2015 at 09:57, Michael Malak < > [email protected]>, wrote: > >> Yes. And a paper that describes using grids (actually varying grids) is >> http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf >> In >> the Spark GraphX In Action book that Robin East and I are writing, we >> implement a drastically simplified version of this in chapter 7, which >> should become available in the MEAP mid-September. >> http://www.manning.com/books/spark-graphx-in-action >> >> >> ------------------------------ >> >> If you don't want to compute all N^2 similarities, you need to implement >> some kind of blocking first. For example, LSH (locally sensitive hashing). >> A quick search gave this link to a Spark implementation: >> >> >> http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing >> >> >> >> On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <[email protected]> >> wrote: >> >> Dear all, >> >> I'm trying to find an efficient way to build a k-NN graph for a large >> dataset. Precisely, I have a large set of high dimensional vector (say d >> >>> 10000) and I want to build a graph where those high dimensional points >> are the vertices and each one is linked to the k-nearest neighbor based on >> some kind similarity defined on the vertex spaces. >> My problem is to implement an efficient algorithm to compute the weight >> matrix of the graph. I need to compute a N*N similarities and the only way >> I know is to use "cartesian" operation follow by "map" operation on RDD. >> But, this is very slow when the N is large. Is there a more cleaver way to >> do this for an arbitrary similarity function ? >> >> Cheers, >> >> Jao >> >> >> >> >>
