Hi, All,
Recently, I tried mahout's hadoop ssvd(mahout-0.9 or mahout-1.0) job. There's a java heap space out of memory problem in ABtDenseOutJob. I found the reason, the ABtDenseOutJob map code is as below: protected void map(Writable key, VectorWritable value, Context context) throws IOException, InterruptedException { Vector vec = value.get(); int vecSize = vec.size(); if (aCols == null) { aCols = new Vector[vecSize]; } else if (aCols.length < vecSize) { aCols = Arrays.copyOf(aCols, vecSize); } if (vec.isDense()) { for (int i = 0; i < vecSize; i++) { extendAColIfNeeded(i, aRowCount + 1); aCols[i].setQuick(aRowCount, vec.getQuick(i)); } } else if (vec.size() > 0) { for (Vector.Element vecEl : vec.nonZeroes()) { int i = vecEl.index(); extendAColIfNeeded(i, aRowCount + 1); aCols[i].setQuick(aRowCount, vecEl.get()); } } aRowCount++; } If the input is RandomAccessSparseVector, usually with big data, it's vec.size() is Integer.MAX_VALUE, which is 2^31, then aCols = new Vector[vecSize] will introduce the OutOfMemory problem. The settlement of course should be enlarge every tasktracker's maximum memory: <property> <name>mapred.child.java.opts</name> <value>-Xmx1024m</value> </property> However, if you are NOT hadoop administrator or ops, you have no permission to modify the config. So, I try to modify ABtDenseOutJob map code to support RandomAccessSparseVector situation, I use hashmap to represent aCols instead of the original Vector[] aCols array, the modified code is as below: private Map<Integer, Vector> aColsMap = new HashMap<Integer, Vector>(); protected void map(Writable key, VectorWritable value, Context context) throws IOException, InterruptedException { Vector vec = value.get(); if (vec.isDense()) { for (int i = 0; i < vecSize; i++) { //extendAColIfNeeded(i, aRowCount + 1); if (aColsMap.get(i) == null) { aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, 100)); } aColsMap.get(i).setQuick(aRowCount, vec.getQuick(i)); //aCols[i].setQuick(aRowCount, vec.getQuick(i)); } } else if (vec.size() > 0) { for (Vector.Element vecEl : vec.nonZeroes()) { int i = vecEl.index(); //extendAColIfNeeded(i, aRowCount + 1); if (aColsMap.get(i) == null) { aColsMap.put(i, new RandomAccessSparseVector(Integer.MAX_VALUE, 100)); } aColsMap.get(i).setQuick(aRowCount, vecEl.get()); //aCols[i].setQuick(aRowCount, vecEl.get()); } } aRowCount++; } Then the OutofMemory problem is dismissed. Thank you!