K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
Hi,

I'm new to Spark and Hadoop, and I'd like to know if the following
problem is solvable in terms of Spark's primitives.

To compute the K-nearest neighbours of a N-dimensional dataset, I can
multiply my very large normalized sparse matrix by its transpose. As
this yields all pairwise distance values (N x N), I can then sort each
row and only keep the K highest elements for each, resulting in a N x
K dense matrix.

As this Quora answer suggests:

http://qr.ae/v03lY

rather than the row-wise dot product, which would be O(N^2), it's
better to compute the sum of the column outer products, which is O(N x
K^2).

However, given the number of non-zero elements in the resulting
matrix, it seems I could not afford to first perform the full
multiplication (N x N) and then prune it afterward (N x K).. So I need
a way to prune it on the fly.

The original algorithm I came up with is roughly this, for an input matrix M:

for each row i:
__outer_i = [0] * N
__for j in nonzero elements of row i:
for k in nonzero elements of col j:
__outer_i[k] += M[i][j] * M[k][j]
__nearest_i = {sort outer_i and keep best K}

which can be parallelized in an embarrassing way, i.e. each compute
node can simply process a slice of the the rows.

Would there be a way to do something similar (or related) with Spark?

Christian


Re: K-NN by efficient sparse matrix product

2014-05-28 Thread Christian Jauvin
Thank you for your answer. Would you have by any chance some example
code (even fragmentary) that I could study?

On 28 May 2014 14:04, Tom Vacek minnesota...@gmail.com wrote:
 Maybe I should add: if you can hold the entire matrix in memory, then this
 is embarrassingly parallel.  If not, then the complications arise.


 On Wed, May 28, 2014 at 1:00 PM, Tom Vacek minnesota...@gmail.com wrote:

 The problem with matrix multiplication is that the amount of data blows up
 between the mapper and the reducer, and the shuffle operation is very slow.
 I have not ever tried this, but the shuffle can be avoided by making use of
 the broadcast.  Say we have M = L*R.  We do a column decomposition on R, and
 we collect rows of L to the master and broadcast them (in manageably-sized
 blocks).  Each worker does a dot product and discards the row block when
 finished.  In theory, this has complexity max(nnz(L)*log p, nnz(L)*n/p).  I
 have to warn though: when I played with matrix multiplication, I was getting
 nowhere near serial performance.


 On Wed, May 28, 2014 at 11:00 AM, Christian Jauvin cjau...@gmail.com
 wrote:

 Hi,

 I'm new to Spark and Hadoop, and I'd like to know if the following
 problem is solvable in terms of Spark's primitives.

 To compute the K-nearest neighbours of a N-dimensional dataset, I can
 multiply my very large normalized sparse matrix by its transpose. As
 this yields all pairwise distance values (N x N), I can then sort each
 row and only keep the K highest elements for each, resulting in a N x
 K dense matrix.

 As this Quora answer suggests:

 http://qr.ae/v03lY

 rather than the row-wise dot product, which would be O(N^2), it's
 better to compute the sum of the column outer products, which is O(N x
 K^2).

 However, given the number of non-zero elements in the resulting
 matrix, it seems I could not afford to first perform the full
 multiplication (N x N) and then prune it afterward (N x K).. So I need
 a way to prune it on the fly.

 The original algorithm I came up with is roughly this, for an input
 matrix M:

 for each row i:
 __outer_i = [0] * N
 __for j in nonzero elements of row i:
 for k in nonzero elements of col j:
 __outer_i[k] += M[i][j] * M[k][j]
 __nearest_i = {sort outer_i and keep best K}

 which can be parallelized in an embarrassing way, i.e. each compute
 node can simply process a slice of the the rows.

 Would there be a way to do something similar (or related) with Spark?

 Christian