The problem with matrix multiplication is that the amount of data blows up
between the mapper and the reducer, and the shuffle operation is very slow.
 I have not ever tried this, but the shuffle can be avoided by making use
of the broadcast.  Say we have M = L*R.  We do a column decomposition on R,
and we collect rows of L to the master and broadcast them (in
manageably-sized blocks).  Each worker does a dot product and discards the
row block when finished.  In theory, this has complexity max(nnz(L)*log p,
nnz(L)*n/p).  I have to warn though: when I played with matrix
multiplication, I was getting nowhere near serial performance.


On Wed, May 28, 2014 at 11:00 AM, Christian Jauvin <cjau...@gmail.com>wrote:

> Hi,
>
> I'm new to Spark and Hadoop, and I'd like to know if the following
> problem is solvable in terms of Spark's primitives.
>
> To compute the K-nearest neighbours of a N-dimensional dataset, I can
> multiply my very large normalized sparse matrix by its transpose. As
> this yields all pairwise distance values (N x N), I can then sort each
> row and only keep the K highest elements for each, resulting in a N x
> K dense matrix.
>
> As this Quora answer suggests:
>
> http://qr.ae/v03lY
>
> rather than the row-wise dot product, which would be O(N^2), it's
> better to compute the sum of the column outer products, which is O(N x
> K^2).
>
> However, given the number of non-zero elements in the resulting
> matrix, it seems I could not afford to first perform the full
> multiplication (N x N) and then prune it afterward (N x K).. So I need
> a way to prune it on the fly.
>
> The original algorithm I came up with is roughly this, for an input matrix
> M:
>
> for each row i:
> __outer_i = [0] * N
> __for j in nonzero elements of row i:
> ____for k in nonzero elements of col j:
> ______outer_i[k] += M[i][j] * M[k][j]
> __nearest_i = {sort outer_i and keep best K}
>
> which can be parallelized in an "embarrassing" way, i.e. each compute
> node can simply process a slice of the the rows.
>
> Would there be a way to do something similar (or related) with Spark?
>
> Christian
>

Reply via email to