Thus the right approach is transpose first, then do an in-memory "map" (in parallel) over the rows of the result, outer multiplying each and sending the rows of this outer product to the "reduce" step, which sums (again in parallel, since their independent there should be no sychronization cost) these rows.
Even though its in-memory, its still classic MapReduce, and well suited to parallelism (although its a good example of a nicely parallel algorithm which is *not* "emberassingly parallel" - both the map and reduce are nontrivial and different). Now that I think about it further, *self* multiplication (ie A * A'), where the entries are all +/- 1 lends itself to even more optimization chances: the rows of the outer product matrix (produced in the Map phase, for each column of the input matrix) are all equal to either that row, or its negative. Calculating these is just changing 16 signs, not 256 multiplications. -jake On Jun 26, 2011 6:51 PM, "Ted Dunning" <[email protected]> wrote: And going down teh columns in a sparse matrix could do this to you. On Sun, Jun 26, 2011 at 6:40 PM, Jake Mannix <[email protected]> wrote: > On Sun, Jun 26, 2011...
