subject:"Build k\-NN graph for large dataset"

Re: Build k-NN graph for large dataset

2015-08-29 Thread Maruf Aytekin

Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
I hope It will help.

Maruf

On Thu, Aug 27, 2015 at 9:16 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

Thank you all for these links. I'll check them.

On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
wrote:

+1 to all of the above esp. Dimensionality reduction and locality
sensitive hashing / min hashing.

There's also an algorithm implemented in MLlib called DIMSUM which was
developed at Twitter for this purpose. I've been meaning to try it and
would be interested to hear about results you get.

https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

Charlie

— Sent from Mailbox

On Wednesday, Aug 26, 2015 at 09:57, Michael Malak
michaelma...@yahoo.com.invalid, wrote:

Yes. And a paper that describes using grids (actually varying grids) is
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
In
the Spark GraphX In Action book that Robin East and I are writing, we
implement a drastically simplified version of this in chapter 7, which
should become available in the MEAP mid-September.
http://www.manning.com/books/spark-graphx-in-action

If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

Dear all,

I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
1) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor based on
some kind similarity defined on the vertex spaces.
My problem is to implement an efficient algorithm to compute the weight
matrix of the graph. I need to compute a N*N similarities and the only way
I know is to use cartesian operation follow by map operation on RDD.
But, this is very slow when the N is large. Is there a more cleaver way to
do this for an arbitrary similarity function ?

Cheers,

Jao

Re: Build k-NN graph for large dataset

2015-08-27 Thread Jaonary Rabarisoa

Thank you all for these links. I'll check them.

On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com
wrote:

+1 to all of the above esp. Dimensionality reduction and locality
sensitive hashing / min hashing.

There's also an algorithm implemented in MLlib called DIMSUM which was
developed at Twitter for this purpose. I've been meaning to try it and
would be interested to hear about results you get.

https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

Charlie

— Sent from Mailbox

On Wednesday, Aug 26, 2015 at 09:57, Michael Malak
michaelma...@yahoo.com.invalid, wrote:

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

Dear all,

Cheers,

Jao

Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa

Dear all,

I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
 1) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor based on
some kind similarity defined on the vertex spaces.
My problem is to implement an efficient algorithm to compute the weight
matrix of the graph. I need to compute a N*N similarities and the only way
I know is to use cartesian operation follow by map operation on RDD.
But, this is very slow when the N is large. Is there a more cleaver way to
do this for an arbitrary similarity function ?

Cheers,

Jao

Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East

You could try dimensionality reduction (PCA or SVD) first. I would imagine that 
even if you could successfully compute similarities in the high-dimensional 
space you would probably run into the curse of dimensionality.
 On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote:
 
 Dear all,
 
 I'm trying to find an efficient way to build a k-NN graph for a large 
 dataset. Precisely, I have a large set of high dimensional vector (say d  
 1) and I want to build a graph where those high dimensional points are 
 the vertices and each one is linked to the k-nearest neighbor based on some 
 kind similarity defined on the vertex spaces. 
 My problem is to implement an efficient algorithm to compute the weight 
 matrix of the graph. I need to compute a N*N similarities and the only way I 
 know is to use cartesian operation follow by map operation on RDD. But, 
 this is very slow when the N is large. Is there a more cleaver way to do this 
 for an arbitrary similarity function ? 
 
 Cheers,
 
 Jao


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Build k-NN graph for large dataset

2015-08-26 Thread Kristina Rogale Plazonic

If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Dear all,

 I'm trying to find an efficient way to build a k-NN graph for a large
 dataset. Precisely, I have a large set of high dimensional vector (say d
  1) and I want to build a graph where those high dimensional points
 are the vertices and each one is linked to the k-nearest neighbor based on
 some kind similarity defined on the vertex spaces.
 My problem is to implement an efficient algorithm to compute the weight
 matrix of the graph. I need to compute a N*N similarities and the only way
 I know is to use cartesian operation follow by map operation on RDD.
 But, this is very slow when the N is large. Is there a more cleaver way to
 do this for an arbitrary similarity function ?

 Cheers,

 Jao

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak

Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action

  From: Kristina Rogale Plazonic kpl...@gmail.com
 To: Jaonary Rabarisoa jaon...@gmail.com 
Cc: user user@spark.apache.org 
 Sent: Wednesday, August 26, 2015 7:24 AM
 Subject: Re: Build k-NN graph for large dataset

If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 
http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d  1) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use cartesian operation follow by map operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 
Cheers,
Jao

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack

+1 to all of the above esp.  Dimensionality reduction and locality sensitive 
hashing / min hashing. 


There's also an algorithm implemented in MLlib called DIMSUM which was 
developed at Twitter for this purpose. I've been meaning to try it and would be 
interested to hear about results you get. 





https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum













Charlie 







—
Sent from Mailbox




On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
michaelma...@yahoo.com.invalid, wrote:


Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action






   
 


If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing












On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

Dear all,


I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d  1) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use cartesian operation follow by map operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 




Cheers,




Jao

Re: Build k-NN graph for large dataset

Re: Build k-NN graph for large dataset

Build k-NN graph for large dataset

Re: Build k-NN graph for large dataset

Re: Build k-NN graph for large dataset

Re: Build k-NN graph for large dataset

Re: Build k-NN graph for large dataset

7 matches

Site Navigation

Mail list logo

Footer information