Fuzzy KMeans fails on reuters corpus with 4GB max heap size

tuxdna Thu, 27 Mar 2014 07:41:52 -0700

I am running Fuzzy KMeans algorithm on Reuters corpus.

I am using Mahout 0.7 on Hadoop 1.1 on Ubuntu 12.04 machine.


Hadoop cluster consists of two machines
 * master: 8GB RAM ( 4 cores )
 * slave: 4GB RAM ( a KVM vm with only 1 core )

When I run this command, the clustering fails at iteration 3 ( cluster-2 ):

$ mahout fkmeans -cd 1.0 -k 21 -m 2 -ow -x 10 -dm $DISTMETRIC -i
$TFIDF_VEC -c $F_INITCLUSTERS -o $F_CLUSTERS

I see the same error in map tasks ( both at master and slave )

syslog logs

2014-03-27 17:01:42,598 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2014-03-27 17:01:42,807 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
already exists!
2014-03-27 17:01:42,871 INFO org.apache.hadoop.util.ProcessTree:
setsid exited with exit code 0
2014-03-27 17:01:42,873 INFO org.apache.hadoop.mapred.Task:  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4d7c07
2014-03-27 17:01:42,944 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2014-03-27 17:01:42,969 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720
2014-03-27 17:01:42,969 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680
2014-03-27 17:01:43,640 INFO
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-03-27 17:01:43,658 INFO org.apache.hadoop.io.nativeio.NativeIO:
Initialized cache for UID to User mapping with a cache timeout of
14400 seconds.
2014-03-27 17:01:43,658 INFO org.apache.hadoop.io.nativeio.NativeIO:
Got UserName hduser for UID 1002 from the native implementation
2014-03-27 17:01:43,660 FATAL org.apache.hadoop.mapred.Child: Error
running child : java.lang.OutOfMemoryError: Java heap space
at 
org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at 
org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at 
org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118)
at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190)
at 
org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:99)
at 
org.apache.mahout.clustering.iterator.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:55)
at org.apache.mahout.clustering.kmeans.Kluster.readFields(Kluster.java:72)
at 
org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:43)
at 
org.apache.mahout.clustering.iterator.ClusterWritable.readFields(ClusterWritable.java:46)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
at 
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at 
org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208)
at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:36)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)


I have tried to set maximum heap size to 4000MB in both master and
slave machines ( in bin/hadoop ) :

JAVA_HEAP_MAX=-Xmx4000m

However still I see the same error as mentioned above.

What else could I do to avoid the problem ?

Another question is that, whether or not can this be resolved using a
later version of Mahout.

I have put complete list of commands used to arrive at this error are
present in this Gist:
 * https://gist.github.com/tuxdna/9808278

Output of Mahout fkmeans is here:
 * http://fpaste.org/89169/

And the task tracker logs are located here:
 * http://fpaste.org/89166/


Thanks and regards,
Saleem

Fuzzy KMeans fails on reuters corpus with 4GB max heap size

Reply via email to