Re: Build k-NN graph for large dataset

2015-08-29 Thread Maruf Aytekin
Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
I hope It will help.

Maruf


On Thu, Aug 27, 2015 at 9:16 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Thank you all for these links. I'll check them.

 On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
 wrote:

 +1 to all of the above esp.  Dimensionality reduction and locality
 sensitive hashing / min hashing.

 There's also an algorithm implemented in MLlib called DIMSUM which was
 developed at Twitter for this purpose. I've been meaning to try it and
 would be interested to hear about results you get.

 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

 ​Charlie


 — Sent from Mailbox

 On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
 michaelma...@yahoo.com.invalid, wrote:

 Yes. And a paper that describes using grids (actually varying grids) is
 http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
  In
 the Spark GraphX In Action book that Robin East and I are writing, we
 implement a drastically simplified version of this in chapter 7, which
 should become available in the MEAP mid-September.
 http://www.manning.com/books/spark-graphx-in-action


 --

 If you don't want to compute all N^2 similarities, you need to implement
 some kind of blocking first. For example, LSH (locally sensitive hashing).
 A quick search gave this link to a Spark implementation:


 http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing



 On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 I'm trying to find an efficient way to build a k-NN graph for a large
 dataset. Precisely, I have a large set of high dimensional vector (say d
  1) and I want to build a graph where those high dimensional points
 are the vertices and each one is linked to the k-nearest neighbor based on
 some kind similarity defined on the vertex spaces.
 My problem is to implement an efficient algorithm to compute the weight
 matrix of the graph. I need to compute a N*N similarities and the only way
 I know is to use cartesian operation follow by map operation on RDD.
 But, this is very slow when the N is large. Is there a more cleaver way to
 do this for an arbitrary similarity function ?

 Cheers,

 Jao








Re: Build k-NN graph for large dataset

2015-08-27 Thread Jaonary Rabarisoa
Thank you all for these links. I'll check them.

On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
wrote:

 +1 to all of the above esp.  Dimensionality reduction and locality
 sensitive hashing / min hashing.

 There's also an algorithm implemented in MLlib called DIMSUM which was
 developed at Twitter for this purpose. I've been meaning to try it and
 would be interested to hear about results you get.

 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

 ​Charlie


 — Sent from Mailbox

 On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
 michaelma...@yahoo.com.invalid, wrote:

 Yes. And a paper that describes using grids (actually varying grids) is
 http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
  In
 the Spark GraphX In Action book that Robin East and I are writing, we
 implement a drastically simplified version of this in chapter 7, which
 should become available in the MEAP mid-September.
 http://www.manning.com/books/spark-graphx-in-action


 --

 If you don't want to compute all N^2 similarities, you need to implement
 some kind of blocking first. For example, LSH (locally sensitive hashing).
 A quick search gave this link to a Spark implementation:


 http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing



 On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 I'm trying to find an efficient way to build a k-NN graph for a large
 dataset. Precisely, I have a large set of high dimensional vector (say d
  1) and I want to build a graph where those high dimensional points
 are the vertices and each one is linked to the k-nearest neighbor based on
 some kind similarity defined on the vertex spaces.
 My problem is to implement an efficient algorithm to compute the weight
 matrix of the graph. I need to compute a N*N similarities and the only way
 I know is to use cartesian operation follow by map operation on RDD.
 But, this is very slow when the N is large. Is there a more cleaver way to
 do this for an arbitrary similarity function ?

 Cheers,

 Jao







Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Dear all,

I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
 1) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor based on
some kind similarity defined on the vertex spaces.
My problem is to implement an efficient algorithm to compute the weight
matrix of the graph. I need to compute a N*N similarities and the only way
I know is to use cartesian operation follow by map operation on RDD.
But, this is very slow when the N is large. Is there a more cleaver way to
do this for an arbitrary similarity function ?

Cheers,

Jao


Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that 
even if you could successfully compute similarities in the high-dimensional 
space you would probably run into the curse of dimensionality.
 On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote:
 
 Dear all,
 
 I'm trying to find an efficient way to build a k-NN graph for a large 
 dataset. Precisely, I have a large set of high dimensional vector (say d  
 1) and I want to build a graph where those high dimensional points are 
 the vertices and each one is linked to the k-nearest neighbor based on some 
 kind similarity defined on the vertex spaces. 
 My problem is to implement an efficient algorithm to compute the weight 
 matrix of the graph. I need to compute a N*N similarities and the only way I 
 know is to use cartesian operation follow by map operation on RDD. But, 
 this is very slow when the N is large. Is there a more cleaver way to do this 
 for an arbitrary similarity function ? 
 
 Cheers,
 
 Jao


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Build k-NN graph for large dataset

2015-08-26 Thread Kristina Rogale Plazonic
If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Dear all,

 I'm trying to find an efficient way to build a k-NN graph for a large
 dataset. Precisely, I have a large set of high dimensional vector (say d
  1) and I want to build a graph where those high dimensional points
 are the vertices and each one is linked to the k-nearest neighbor based on
 some kind similarity defined on the vertex spaces.
 My problem is to implement an efficient algorithm to compute the weight
 matrix of the graph. I need to compute a N*N similarities and the only way
 I know is to use cartesian operation follow by map operation on RDD.
 But, this is very slow when the N is large. Is there a more cleaver way to
 do this for an arbitrary similarity function ?

 Cheers,

 Jao



Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action

  From: Kristina Rogale Plazonic kpl...@gmail.com
 To: Jaonary Rabarisoa jaon...@gmail.com 
Cc: user user@spark.apache.org 
 Sent: Wednesday, August 26, 2015 7:24 AM
 Subject: Re: Build k-NN graph for large dataset
   
If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 
http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing


On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d  1) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use cartesian operation follow by map operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 
Cheers,
Jao



  

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack
+1 to all of the above esp.  Dimensionality reduction and locality sensitive 
hashing / min hashing. 


There's also an algorithm implemented in MLlib called DIMSUM which was 
developed at Twitter for this purpose. I've been meaning to try it and would be 
interested to hear about results you get. 





https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum













​Charlie 







—
Sent from Mailbox




On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
michaelma...@yahoo.com.invalid, wrote:


Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action






   
 


If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing












On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

Dear all,


I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d  1) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use cartesian operation follow by map operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 




Cheers,




Jao