Maybe I should add: if you can hold the entire matrix in memory, then this is embarrassingly parallel. If not, then the complications arise.
On Wed, May 28, 2014 at 1:00 PM, Tom Vacek <minnesota...@gmail.com> wrote: > The problem with matrix multiplication is that the amount of data blows up > between the mapper and the reducer, and the shuffle operation is very slow. > I have not ever tried this, but the shuffle can be avoided by making use > of the broadcast. Say we have M = L*R. We do a column decomposition on R, > and we collect rows of L to the master and broadcast them (in > manageably-sized blocks). Each worker does a dot product and discards the > row block when finished. In theory, this has complexity max(nnz(L)*log p, > nnz(L)*n/p). I have to warn though: when I played with matrix > multiplication, I was getting nowhere near serial performance. > > > On Wed, May 28, 2014 at 11:00 AM, Christian Jauvin <cjau...@gmail.com>wrote: > >> Hi, >> >> I'm new to Spark and Hadoop, and I'd like to know if the following >> problem is solvable in terms of Spark's primitives. >> >> To compute the K-nearest neighbours of a N-dimensional dataset, I can >> multiply my very large normalized sparse matrix by its transpose. As >> this yields all pairwise distance values (N x N), I can then sort each >> row and only keep the K highest elements for each, resulting in a N x >> K dense matrix. >> >> As this Quora answer suggests: >> >> http://qr.ae/v03lY >> >> rather than the row-wise dot product, which would be O(N^2), it's >> better to compute the sum of the column outer products, which is O(N x >> K^2). >> >> However, given the number of non-zero elements in the resulting >> matrix, it seems I could not afford to first perform the full >> multiplication (N x N) and then prune it afterward (N x K).. So I need >> a way to prune it on the fly. >> >> The original algorithm I came up with is roughly this, for an input >> matrix M: >> >> for each row i: >> __outer_i = [0] * N >> __for j in nonzero elements of row i: >> ____for k in nonzero elements of col j: >> ______outer_i[k] += M[i][j] * M[k][j] >> __nearest_i = {sort outer_i and keep best K} >> >> which can be parallelized in an "embarrassing" way, i.e. each compute >> node can simply process a slice of the the rows. >> >> Would there be a way to do something similar (or related) with Spark? >> >> Christian >> > >