DistributedRowMatrix.transpose Memory Woes

Paul Mahon Wed, 27 Apr 2011 12:43:59 -0700

I'm having trouble using Mahout's (0.4) DistributedRowMatrix transposemethod. The matrix I'm transposing is about 12 million rows by 2.5million columns. It's quite sparse (no more than 10 non-zero elementsper row) so memory shouldn't be a problem. However, running transposealways runs out of memory in the reduce step:

2011-04-27 10:51:29,910 FATAL org.apache.hadoop.mapred.TaskTracker:Error running child : java.lang.OutOfMemoryError: Java heap spaceatorg.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)atorg.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)atorg.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)atorg.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:142)atorg.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:122)

        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

In digging into the problem I found out that the reduce task is beingwith with -Xmx=200m. That is the default hadoopmapred.child.java.opts, since I didn't override it in the mapred confon the machine running the job. It should be possible to setparameters which are used by TransposeJob when called from thetranspose method, but it seems there isn't.

Did I miss some other way of transposing the matrix or some way toconfigure the transpose job?

DistributedRowMatrix.transpose Memory Woes

Reply via email to