I am running Fuzzy KMeans algorithm on Reuters corpus. I am using Mahout 0.7 on Hadoop 1.1 on Ubuntu 12.04 machine.
Hadoop cluster consists of two machines * master: 8GB RAM ( 4 cores ) * slave: 4GB RAM ( a KVM vm with only 1 core ) When I run this command, the clustering fails at iteration 3 ( cluster-2 ): $ mahout fkmeans -cd 1.0 -k 21 -m 2 -ow -x 10 -dm $DISTMETRIC -i $TFIDF_VEC -c $F_INITCLUSTERS -o $F_CLUSTERS I see the same error in map tasks ( both at master and slave ) syslog logs 2014-03-27 17:01:42,598 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2014-03-27 17:01:42,807 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2014-03-27 17:01:42,871 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2014-03-27 17:01:42,873 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4d7c07 2014-03-27 17:01:42,944 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2014-03-27 17:01:42,969 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2014-03-27 17:01:42,969 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2014-03-27 17:01:43,640 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2014-03-27 17:01:43,658 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds. 2014-03-27 17:01:43,658 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hduser for UID 1002 from the native implementation 2014-03-27 17:01:43,660 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434) at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139) at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118) at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190) at org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:99) at org.apache.mahout.clustering.iterator.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:55) at org.apache.mahout.clustering.kmeans.Kluster.readFields(Kluster.java:72) at org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:43) at org.apache.mahout.clustering.iterator.ClusterWritable.readFields(ClusterWritable.java:46) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76) at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:36) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249) I have tried to set maximum heap size to 4000MB in both master and slave machines ( in bin/hadoop ) : JAVA_HEAP_MAX=-Xmx4000m However still I see the same error as mentioned above. What else could I do to avoid the problem ? Another question is that, whether or not can this be resolved using a later version of Mahout. I have put complete list of commands used to arrive at this error are present in this Gist: * https://gist.github.com/tuxdna/9808278 Output of Mahout fkmeans is here: * http://fpaste.org/89169/ And the task tracker logs are located here: * http://fpaste.org/89166/ Thanks and regards, Saleem
