Hi,
I have difficulties with clusterdump and clusterpp.
Experimenting with Mahout's k-means clusterer, I have successfully
obtained kmeans results and clusterdump output for small custom
datasets.
However, when I try to use a larger dataset with 1M datapoints and
1000 clusters, I start to run into problems with the clusterdump tool.
The kmeans step runs fine, but clusterdump fails with with an out of
memory error:
mahout clusterdump
-i output/lda-vecs-1M-kmeans-k1000-x20/clusters-*-final
-o output/lda-vecs-1M-kmeans-k1000-x20/clusterdump
-dt sequencefile
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
--pointsDir output/lda-vecs-1M-kmeans-k1000-x20/clusteredPoints/
Running on hadoop, using /Users/issuu/florian/hadoop-1.0.3/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/issuu/florian/mahout/mahout-distribution-0.7/mahout-examples-0.7-job.jar
13/03/11 16:21:04 INFO common.AbstractJob: Command line arguments:
{--dictionaryType=[sequencefile],
--distanceMeasure=[org.apache.mahout.common.distance.EuclideanDistanceMeasure],
--endPhase=[2147483647],
--input=[output/lda-vecs-1M-kmeans-k1000-x20/clusters-*-final],
--output=[output/lda-vecs-1M-kmeans-k1000-x20/clusterdump],
--outputFormat=[TEXT],
--pointsDir=[output/lda-vecs-1M-kmeans-k1000-x20/clusteredPoints/],
--startPhase=[0], --tempDir=[temp]}
2013-03-11 16:21:04.631 java[79418:1203] Unable to load realm info
from SCDynamicStore
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
at
org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99)
at
org.apache.mahout.clustering.classify.WeightedVectorWritable.readFields(WeightedVectorWritable.java:56)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1809)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
at
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:292)
at
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:245)
at
org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:152)
at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:102)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
How do I increase the heap space in a way that the mahout command accepts it?
I've set both MAHOUT_HEAPSIZE to 40000 and JAVA_HEAP_MAX to -Xmx40g
with no effect.
I also read that this is because clusterdump tries to read the whole
clustering input into memory,
and that one one instead use clusterpp to separate the clusters, so I
also tried to use the clusterpp utility as suggested in
http://mail-archives.apache.org/mod_mbox/mahout-user/201210.mbox/%3CCAPa28_KPzBzhn5Ug0jQNstbzFOBMhADdfGTrcg2p1HCw5GXWWw%40mail.gmail.com%3E
However that fails because it apparently can't determine the number of clusters.
$ mahout clusterpp
-i output/lda-vecs-1M-kmeans-k1000-x20/clusters-20-final/
-o /user/issuu/output/lda-vecs-1M-kmeans-k1000-x20/clusterGroups
Running on hadoop, using /Users/issuu/florian/hadoop-1.0.3/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/issuu/florian/mahout/mahout-distribution-0.7/mahout-examples-0.7-job.jar
13/03/11 16:34:55 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[output/lda-vecs-1M-kmeans-k1000-x20/clusters-20-final/],
--method=[mapreduce],
--output=[/user/issuu/output/lda-vecs-1M-kmeans-k1000-x20/clusterGroups],
--startPhase=[0], --tempDir=[temp]}
2013-03-11 16:34:55.925 java[79653:1203] Unable to load realm info
from SCDynamicStore
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at
org.apache.mahout.clustering.topdown.postprocessor.ClusterCountReader.getNumberOfClusters(ClusterCountReader.java:53)
at
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.postProcessMR(ClusterOutputPostProcessorDriver.java:150)
at
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.run(ClusterOutputPostProcessorDriver.java:104)
at
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.run(ClusterOutputPostProcessorDriver.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.main(ClusterOutputPostProcessorDriver.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I'm running Mahout 0.7 on Hadoop 1.0.3 in pseudo-distributed mode.
What am I doing wrong?
What is the recommended way to get the output of a large(-ish)
clustering job into some human-readable format?
Thanks,
Florian