Thanks for your reply.

1. The version of Mahout I'm using is 0.9.
2. DistributedRowMatrix is MapReduce based and the implementation code of Mapper and Reducer is in MatrixMultiplicationJob class.

I googled it for a while and my conclusion is that matrix multiplication based on DistributedRowMatrix can run only on one mapper because of CompositeInputFormat.
http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join

The author of this post thinks split the matrix to multiple files manually could solve this problem, but he is wrong.
http://comments.gmane.org/gmane.comp.apache.mahout.user/15550
For example, A folder contatins matrix a and b, B folder contains matrix c and d, there will be 2 maps but the result will be a*c+b*d rather than concat(a, b)*concat(c, d).


On 2014-6-17 15:40, Suneel Marthi wrote:
DRM is not for demo and is used across several Mahout jobs like
RowSimilarityJob etc...

a) What's the Mahout version u r working off of?
b) Have u tried using MatrixMultiplicationJob which is MapReduce based?


On Tue, Jun 17, 2014 at 3:05 AM, Han Fan <[email protected]> wrote:

I have a 6kx10k matrix T and I need the result of T'*T which should be
10kx10k. I want to do this using Mahout DistributedRowMatrix but I found
Hadoop caculates with only one mapper which is very slow.

I digged into the source code of DistributedRowMatrix and found that the
input format of DistributedRowMatrix  is CompositeInputFormat.class which
has a method named getSplits that set mapred.min.split.size to
Long.MAX_VALUE.

So my question is that is DistributedRowMatrix only a demo to show that
matrix multiplication could be done using MapReduce but has no practical
value? Is there any way to do matrix multiplication quickly using Hadoop?

Thanks for your time and sorry for my broken English.





Reply via email to