I'm having trouble using Mahout's (0.4) DistributedRowMatrix transpose method. The matrix I'm transposing is about 12 million rows by 2.5 million columns. It's quite sparse (no more than 10 non-zero elements per row) so memory shouldn't be a problem. However, running transpose always runs out of memory in the reduce step:

2011-04-27 10:51:29,910 FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434) at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134) at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:142) at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:122)
        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

In digging into the problem I found out that the reduce task is being with with -Xmx=200m. That is the default hadoop mapred.child.java.opts, since I didn't override it in the mapred conf on the machine running the job. It should be possible to set parameters which are used by TransposeJob when called from the transpose method, but it seems there isn't.

Did I miss some other way of transposing the matrix or some way to configure the transpose job?

Reply via email to