I tested the performance of the threaded matrix multiplication on one machine and my results are provided. I can multiply large matricies, however the running time is proportional to the number of rows in matrix A and is limited by how much memory exists on the machine. I was able to get the product of 2 matricies with 1,000,000 non zero elements in about 30 minutes.
My implementation of sparse matrix multiplication is similar as Jake describes. During "load" I store each matrix as 2 OpenIntObjectHashMap of Vectors, one for rows and one for columns. Though it uses up 2x as much memory for each matrix, it works well for sparse matrices. For the multiplication, I queue up the tupples and distribute them be processed on threads. The results are a bit inconsistent, but I think this is an artifact from the the random sparse matrix generator I used. Concluding remarks: The amount of time a calculation takes is heavily determined by the number of rows in the matrix. (This is because each row is a thread in the system). The implementation scales well with lots of columns. ====================== Details Below ====================== Here is how I generated my results: 1) Choose 2 dimensions and a density 2) Create 2 random matrices, Matrix A (row x column) and Matrix B (column x row) 3) Multiply AxB I tested its ability to scale by: - Scaling up the dimensions of the matrix while keeping the density the same. (number of non-zero elements will still increase) - Scaling up the dimensions of the matrix while keeping the number of non zero elements the same - Scaling up the number of of non zero elements for a large dimension. ================================================= Started by increasing row dimensions Making: 10000 10 that are 0.01 full Non Zero Elements: 1000 Non Zero Elements: 1000 Took this long to make the random matrices: 47 Using 16 processors Rows: 10000 Columns: 10 Finished In Time Finished and Took: 4509 ================================================= Non Zero Elements: 10000 Non Zero Elements: 10000 Took this long to make the random matrices: 167 Using 16 processors Rows: 100000 Columns: 10 Finished In Time Finished and Took: 419494 ~ 6.99156667 minutes ================================================= Making: 100000 100 that are 0.01 full Non Zero Elements: 100000 Non Zero Elements: 100000 Took this long to make the random matrices: 284 Using 16 processors Rows: 100000 Columns: 100 Finished In Time Finished and Took: 328268 ~ 5.47113333 minutes ================================================= Making: 100000 1000 that are 0.01 full Non Zero Elements: 1000000 Non Zero Elements: 1000000 Took this long to make the random matrices: 4349 Using 16 processors Rows: 100000 Columns: 1000 Finished In Time Finished and Took: 1408320 ~ 23.47200 minutes (Need Lots of Ram) =================================================Started scaling up column dimensions here, (constant non zeros) Making: 100000 1000 that are 0.0010 full Non Zero Elements: 100000 Non Zero Elements: 100000 Took this long to make the random matrices: 338 Using 16 processors Rows: 100000 Columns: 1000 Finished In Time Finished and Took: 339435 ~ 5.65725 minutes ================================================= Making: 100000 10000 that are 1.0E-4 full Non Zero Elements: 100000 Non Zero Elements: 100000 Took this long to make the random matrices: 270 Using 16 processors Rows: 100000 Columns: 10000 Finished In Time Finished and Took: 386642 ~ 6.44403333 minutes ================================================= Making: 100000 100000 that are 1.0E-5 full Non Zero Elements: 100000 Non Zero Elements: 100000 Took this long to make the random matrices: 273 Using 16 processors Rows: 100000 Columns: 100000 Finished In Time Finished and Took: 412046 ~ 6.86743333 minutes ================================================= Making: 100000 1000000 that are 1.0E-6 full Non Zero Elements: 99999 Non Zero Elements: 100000 Took this long to make the random matrices: 376 Using 16 processors Rows: 100000 Columns: 1000000 Finished In Time Finished and Took: 352324~ 5.87206667 minutes ================================================= Making: 100000 10000000 that are 1.0E-7 full Non Zero Elements: 100000 Non Zero Elements: 100000 Took this long to make the random matrices: 337 Using 16 processors Rows: 100000 Columns: 10000000 Finished In Time Finished and Took: 411278 ~ 6.85463333 minutes =================================================Started Scaling up density here Making: 100000 100000 that are 2.0E-5 full Non Zero Elements: 200000 Non Zero Elements: 200000 Took this long to make the random matrices: 531 Using 16 processors Rows: 100000 Columns: 100000 Finished In Time Finished and Took: 375450~ 6.2575 minutes ================================================= Making: 100000 100000 that are 3.0E-5 full Non Zero Elements: 300000 Non Zero Elements: 300000 Took this long to make the random matrices: 706 Using 16 processors Rows: 100000 Columns: 100000 Finished In Time Finished and Took: 411555 ~ 6.85925 minutes ================================================= Making: 100000 100000 that are 4.0E-5 full Non Zero Elements: 400000 Non Zero Elements: 400000 Took this long to make the random matrices: 897 Using 16 processors Rows: 100000 Columns: 100000 Finished In Time Finished and Took: 509539 ~ 8.49231667 minutes On Tue, Jun 28, 2011 at 4:31 PM, Jeff Hansen <[email protected]> wrote: > I think you've just described the approach laid out in this document -- > http://www.google.com/url?sa=t&source=web&cd=1&ved=0CB8QFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.144.9712%26rep%3Drep1%26type%3Dpdf&ei=WfAJTtr8Ju6FsgKz6cyeAQ&usg=AFQjCNFxQPXXnYw93eBpUknGy2iZJESmHw&sig2=zFxncZg4n7b_mueF4mbGPQ > > There's one other trick you could employ -- since the resulting matrix is > symmetric, you really only have to bother with half of the multiplications > and sums (plus the diagonal). After all, if A*A' = B, then Bij = Bji. So > if you only calculate Bij for i <= j, you can flip the matrix to fill in the > other half. That means when you're doing your cross product in the mapper > phase you've mentioned, you only need to do half the cross multiplications: > > for(int j = i; j <= i;j++)emit i, j, i*j; > > Of course this is all based on the assumption that we're multiplying a > matrix with itself transposed. > > On Sun, Jun 26, 2011 at 10:34 PM, Jake Mannix <[email protected]> wrote: > >> Thus the right approach is transpose first, then do an in-memory "map" (in >> parallel) over the rows of the result, outer multiplying each and sending >> the rows of this outer product to the "reduce" step, which sums (again in >> parallel, since their independent there should be no sychronization cost) >> these rows. >> >> Even though its in-memory, its still classic MapReduce, and well suited to >> parallelism (although its a good example of a nicely parallel algorithm >> which is *not* "emberassingly parallel" - both the map and reduce are >> nontrivial and different). >> >> Now that I think about it further, *self* multiplication (ie A * A'), where >> the entries are all +/- 1 lends itself to even more optimization chances: >> the rows of the outer product matrix (produced in the Map phase, for each >> column of the input matrix) are all equal to either that row, or its >> negative. Calculating these is just changing 16 signs, not 256 >> multiplications. >> >> -jake >> >> On Jun 26, 2011 6:51 PM, "Ted Dunning" <[email protected]> wrote: >> >> And going down teh columns in a sparse matrix could do this to you. >> >> On Sun, Jun 26, 2011 at 6:40 PM, Jake Mannix <[email protected]> >> wrote: >> > On Sun, Jun 26, 2011... >> >
