I'm having trouble using Mahout's (0.4) DistributedRowMatrix transpose
method. The matrix I'm transposing is about 12 million rows by 2.5
million columns. It's quite sparse (no more than 10 non-zero elements
per row) so memory shouldn't be a problem. However, running transpose
always runs out of memory in the reduce step:
2011-04-27 10:51:29,910 FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
at
org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at
org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at
org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)
at
org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:142)
at
org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:122)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
In digging into the problem I found out that the reduce task is being
with with -Xmx=200m. That is the default hadoop
mapred.child.java.opts, since I didn't override it in the mapred conf
on the machine running the job. It should be possible to set
parameters which are used by TransposeJob when called from the
transpose method, but it seems there isn't.
Did I miss some other way of transposing the matrix or some way to
configure the transpose job?