Re: Mahout DistributedRowMatrix run with only one mapper

Han Fan Tue, 17 Jun 2014 02:20:13 -0700

Thanks for your reply.

1. The version of Mahout I'm using is 0.9.

2. DistributedRowMatrix is MapReduce based and the implementation codeof Mapper and Reducer is in MatrixMultiplicationJob class.

I googled it for a while and my conclusion is that matrix multiplicationbased on DistributedRowMatrix can run only on one mapper because ofCompositeInputFormat.

http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join

The author of this post thinks split the matrix to multiple filesmanually could solve this problem, but he is wrong.

http://comments.gmane.org/gmane.comp.apache.mahout.user/15550

For example, A folder contatins matrix a and b, B folder contains matrixc and d, there will be 2 maps but the result will be a*c+b*d rather thanconcat(a, b)*concat(c, d).



On 2014-6-17 15:40, Suneel Marthi wrote:

DRM is not for demo and is used across several Mahout jobs like
RowSimilarityJob etc...

a) What's the Mahout version u r working off of?
b) Have u tried using MatrixMultiplicationJob which is MapReduce based?


On Tue, Jun 17, 2014 at 3:05 AM, Han Fan <[email protected]> wrote:

I have a 6kx10k matrix T and I need the result of T'*T which should be
10kx10k. I want to do this using Mahout DistributedRowMatrix but I found
Hadoop caculates with only one mapper which is very slow.

I digged into the source code of DistributedRowMatrix and found that the
input format of DistributedRowMatrix  is CompositeInputFormat.class which
has a method named getSplits that set mapred.min.split.size to
Long.MAX_VALUE.

So my question is that is DistributedRowMatrix only a demo to show that
matrix multiplication could be done using MapReduce but has no practical
value? Is there any way to do matrix multiplication quickly using Hadoop?

Thanks for your time and sorry for my broken English.

Re: Mahout DistributedRowMatrix run with only one mapper

Reply via email to