I tested the performance of the threaded matrix multiplication on one
machine and my results are provided. I can multiply large matricies,
however the running time is proportional to the number of rows in
matrix A and is limited by how much memory exists on the machine. I
was able to get the product of 2 matricies with 1,000,000 non zero
elements in about 30 minutes.

My implementation of sparse matrix multiplication is similar as Jake
describes. During "load" I store each matrix as 2 OpenIntObjectHashMap
of Vectors, one for rows and one for columns. Though it uses up 2x as
much memory for each matrix, it works well for sparse matrices. For
the multiplication, I queue up the tupples and distribute them be
processed on threads.

The results are a bit inconsistent, but I think this is an artifact
from the the random sparse matrix generator I used.

Concluding remarks:
The amount of time a calculation takes is heavily determined by the
number of rows in the matrix. (This is because each row is a thread in
the system).
The implementation scales well with lots of columns.


======================
Details Below
======================

Here is how I generated my results:
1) Choose 2 dimensions and a density
2) Create 2 random matrices, Matrix A (row x column) and Matrix B (column x row)
3) Multiply AxB

I tested its ability to scale by:
- Scaling up the dimensions of the matrix while keeping the density
the same. (number of non-zero elements will still increase)
- Scaling up the dimensions of the matrix while keeping the number of
non zero elements the same
- Scaling up the number of of non zero elements for a large dimension.


================================================= Started by
increasing row dimensions
Making: 10000 10 that are 0.01 full
Non Zero Elements: 1000
Non Zero Elements: 1000
Took this long to make the random matrices: 47
Using 16 processors
Rows: 10000
Columns: 10
Finished In Time
Finished and Took: 4509
=================================================
Non Zero Elements: 10000
Non Zero Elements: 10000
Took this long to make the random matrices: 167
Using 16 processors
Rows: 100000
Columns: 10
Finished In Time
Finished and Took: 419494 ~ 6.99156667 minutes
=================================================
Making: 100000 100 that are 0.01 full
Non Zero Elements: 100000
Non Zero Elements: 100000
Took this long to make the random matrices: 284
Using 16 processors
Rows: 100000
Columns: 100
Finished In Time
Finished and Took: 328268 ~ 5.47113333 minutes
=================================================
Making: 100000 1000 that are 0.01 full
Non Zero Elements: 1000000
Non Zero Elements: 1000000
Took this long to make the random matrices: 4349
Using 16 processors
Rows: 100000
Columns: 1000
Finished In Time
Finished and Took: 1408320 ~ 23.47200 minutes (Need Lots of Ram)
=================================================Started scaling up
column dimensions here, (constant non zeros)
Making: 100000 1000 that are 0.0010 full
Non Zero Elements: 100000
Non Zero Elements: 100000
Took this long to make the random matrices: 338
Using 16 processors
Rows: 100000
Columns: 1000
Finished In Time
Finished and Took: 339435 ~ 5.65725 minutes
=================================================
Making: 100000 10000 that are 1.0E-4 full
Non Zero Elements: 100000
Non Zero Elements: 100000
Took this long to make the random matrices: 270
Using 16 processors
Rows: 100000
Columns: 10000
Finished In Time
Finished and Took: 386642 ~ 6.44403333 minutes
=================================================
Making: 100000 100000 that are 1.0E-5 full
Non Zero Elements: 100000
Non Zero Elements: 100000
Took this long to make the random matrices: 273
Using 16 processors
Rows: 100000
Columns: 100000
Finished In Time
Finished and Took: 412046 ~  6.86743333 minutes
=================================================
Making: 100000 1000000 that are 1.0E-6 full
Non Zero Elements: 99999
Non Zero Elements: 100000
Took this long to make the random matrices: 376
Using 16 processors
Rows: 100000
Columns: 1000000
Finished In Time
Finished and Took: 352324~ 5.87206667 minutes
=================================================
Making: 100000 10000000 that are 1.0E-7 full
Non Zero Elements: 100000
Non Zero Elements: 100000
Took this long to make the random matrices: 337
Using 16 processors
Rows: 100000
Columns: 10000000
Finished In Time
Finished and Took: 411278 ~ 6.85463333 minutes
=================================================Started Scaling up density here
Making: 100000 100000 that are 2.0E-5 full
Non Zero Elements: 200000
Non Zero Elements: 200000
Took this long to make the random matrices: 531
Using 16 processors
Rows: 100000
Columns: 100000
Finished In Time
Finished and Took: 375450~ 6.2575 minutes
=================================================
Making: 100000 100000 that are 3.0E-5 full
Non Zero Elements: 300000
Non Zero Elements: 300000
Took this long to make the random matrices: 706
Using 16 processors
Rows: 100000
Columns: 100000
Finished In Time
Finished and Took: 411555 ~ 6.85925 minutes
=================================================
Making: 100000 100000 that are 4.0E-5 full
Non Zero Elements: 400000
Non Zero Elements: 400000
Took this long to make the random matrices: 897
Using 16 processors
Rows: 100000
Columns: 100000
Finished In Time
Finished and Took: 509539 ~ 8.49231667 minutes



On Tue, Jun 28, 2011 at 4:31 PM, Jeff Hansen <[email protected]> wrote:
> I think you've just described the approach laid out in this document --
> http://www.google.com/url?sa=t&source=web&cd=1&ved=0CB8QFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.144.9712%26rep%3Drep1%26type%3Dpdf&ei=WfAJTtr8Ju6FsgKz6cyeAQ&usg=AFQjCNFxQPXXnYw93eBpUknGy2iZJESmHw&sig2=zFxncZg4n7b_mueF4mbGPQ
>
> There's one other trick you could employ -- since the resulting matrix is
> symmetric, you really only have to bother with half of the multiplications
> and sums (plus the diagonal).  After all, if A*A' = B, then Bij = Bji.  So
> if you only calculate Bij for i <= j, you can flip the matrix to fill in the
> other half.  That means when you're doing your cross product in the mapper
> phase you've mentioned, you only need to do half the cross multiplications:
>
> for(int j = i; j <= i;j++)emit i, j, i*j;
>
> Of course this is all based on the assumption that we're multiplying a
> matrix with itself transposed.
>
> On Sun, Jun 26, 2011 at 10:34 PM, Jake Mannix <[email protected]> wrote:
>
>> Thus the right approach is transpose first, then do an in-memory "map" (in
>> parallel) over the rows of the result, outer multiplying each and sending
>> the rows of this outer product to the "reduce" step, which sums (again in
>> parallel, since their independent there should be no sychronization cost)
>> these rows.
>>
>> Even though its in-memory, its still classic MapReduce, and well suited to
>> parallelism (although its a good example of a nicely parallel algorithm
>> which is *not* "emberassingly parallel" - both the map and reduce are
>> nontrivial and different).
>>
>> Now that I think about it further, *self* multiplication (ie A * A'), where
>> the entries are all +/- 1 lends itself to even more optimization chances:
>> the rows of the outer product matrix (produced in the Map phase, for each
>> column of the input matrix) are all equal to either that row, or its
>> negative.  Calculating these is just changing 16 signs, not 256
>> multiplications.
>>
>>  -jake
>>
>> On Jun 26, 2011 6:51 PM, "Ted Dunning" <[email protected]> wrote:
>>
>> And going down teh columns in a sparse matrix could do this to you.
>>
>> On Sun, Jun 26, 2011 at 6:40 PM, Jake Mannix <[email protected]>
>> wrote:
>> > On Sun, Jun 26, 2011...
>>
>

Reply via email to