Thus the right approach is transpose first, then do an in-memory "map" (in
parallel) over the rows of the result, outer multiplying each and sending
the rows of this outer product to the "reduce" step, which sums (again in
parallel, since their independent there should be no sychronization cost)
these rows.

Even though its in-memory, its still classic MapReduce, and well suited to
parallelism (although its a good example of a nicely parallel algorithm
which is *not* "emberassingly parallel" - both the map and reduce are
nontrivial and different).

Now that I think about it further, *self* multiplication (ie A * A'), where
the entries are all +/- 1 lends itself to even more optimization chances:
the rows of the outer product matrix (produced in the Map phase, for each
column of the input matrix) are all equal to either that row, or its
negative.  Calculating these is just changing 16 signs, not 256
multiplications.

  -jake

On Jun 26, 2011 6:51 PM, "Ted Dunning" <[email protected]> wrote:

And going down teh columns in a sparse matrix could do this to you.

On Sun, Jun 26, 2011 at 6:40 PM, Jake Mannix <[email protected]> wrote:
> On Sun, Jun 26, 2011...

Reply via email to