Adam, Pardon my asking again if this has already been answered - Are you running against Mahout 0.8?
________________________________ From: Adam Baron <[email protected]> To: [email protected]; Suneel Marthi <[email protected]> Sent: Tuesday, August 6, 2013 6:56 PM Subject: Re: How to get human-readable output for large clustering? Suneel, I was trying -n 25 and -b 100 when I sent my e-mail about it not working for me. Just tried -n 20 and got the same error message. Any other ideas? Thanks, Adam On Mon, Aug 5, 2013 at 7:40 PM, Suneel Marthi <[email protected]> wrote: Adam/Florian, > >Could you try running the clusterdump by limiting the number of terms from >clusterdump, by specifying -n 20 (outputs the 20 top terms)? > > > > >________________________________ > From: Adam Baron <[email protected]> >To: [email protected] >Sent: Monday, August 5, 2013 8:03 PM >Subject: Re: How to get human-readable output for large clustering? > > > >Florian, > >Any luck finding an answer over the past 5 months? I'm also dealing with >similar out of memory errors when I run clusterdump. I'm using 50,000 >features and tried k=500. The kmeans command ran fine, but then I got the >dreaded OutOfMemory error on with the clusterdump command: > >2013-08-05 18:46:01,686 FATAL org.apache.hadoop.mapred.Child: Error running >child : java.lang.OutOfMemoryError: Java heap space > at >org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434) > at >org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) > at >org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139) > at >org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118) > at >org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114) > at >org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242) > at >org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) > at >org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) > at >com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at >com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at >com.google.common.collect.Iterators$5.hasNext(Iterators.java:543) > at >com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) > at >org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:103) > at >org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:97) > at >org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.setup(RepresentativePointsMapper.java:87) > at >org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) > at >org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673) > at >org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) > at >org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at >java.security.AccessController.doPrivileged(Native Method) > at >javax.security.auth.Subject.doAs(Subject.java:396) > at >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > at >org.apache.hadoop.mapred.Child.main(Child.java:262) > >Thanks, > Adam > >On Mon, Mar 11, 2013 at 8:42 AM, Florian Laws <[email protected]>wrote: > >> Hi, >> >> I have difficulties with clusterdump and clusterpp. >> >> Experimenting with Mahout's k-means clusterer, I have successfully >> obtained kmeans results and clusterdump output for small custom >> datasets. >> >> However, when I try to use a larger dataset with 1M datapoints and >> 1000 clusters, I start to run into problems with the clusterdump tool. >> The kmeans step runs fine, but clusterdump fails with with an out of >> memory error: >> >> mahout clusterdump >> -i output/lda-vecs-1M-kmeans-k1000-x20/clusters-*-final >> -o output/lda-vecs-1M-kmeans-k1000-x20/clusterdump >> -dt sequencefile >> -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure >> --pointsDir output/lda-vecs-1M-kmeans-k1000-x20/clusteredPoints/ >> >> Running on hadoop, using /Users/issuu/florian/hadoop-1.0.3/bin/hadoop >> and HADOOP_CONF_DIR= >> MAHOUT-JOB: >> /Users/issuu/florian/mahout/mahout-distribution-0.7/mahout-examples-0.7-job.jar >> 13/03/11 16:21:04 INFO common.AbstractJob: Command line arguments: >> {--dictionaryType=[sequencefile], >> >> --distanceMeasure=[org.apache.mahout.common.distance.EuclideanDistanceMeasure], >> --endPhase=[2147483647], >> --input=[output/lda-vecs-1M-kmeans-k1000-x20/clusters-*-final], >> --output=[output/lda-vecs-1M-kmeans-k1000-x20/clusterdump], >> --outputFormat=[TEXT], >> --pointsDir=[output/lda-vecs-1M-kmeans-k1000-x20/clusteredPoints/], >> --startPhase=[0], --tempDir=[temp]} >> 2013-03-11 16:21:04.631 java[79418:1203] Unable to load realm info >> from SCDynamicStore >> >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44) >> at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39) >> at >> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99) >> at >> org.apache.mahout.clustering.classify.WeightedVectorWritable.readFields(WeightedVectorWritable.java:56) >> at >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1809) >> at >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937) >> at >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) >> at >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) >> at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) >> at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) >> at >> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) >> at >> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) >> at >> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:292) >> at >> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:245) >> at >> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:152) >> at >> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:102) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >> >> How do I increase the heap space in a way that the mahout command accepts >> it? >> I've set both MAHOUT_HEAPSIZE to 40000 and JAVA_HEAP_MAX to -Xmx40g >> with no effect. >> >> >> I also read that this is because clusterdump tries to read the whole >> clustering input into memory, >> and that one one instead use clusterpp to separate the clusters, so I >> also tried to use the clusterpp utility as suggested in >> >> http://mail-archives.apache.org/mod_mbox/mahout-user/201210.mbox/%3CCAPa28_KPzBzhn5Ug0jQNstbzFOBMhADdfGTrcg2p1HCw5GXWWw%40mail.gmail.com%3E >> >> However that fails because it apparently can't determine the number of >> clusters. >> >> $ mahout clusterpp >> -i output/lda-vecs-1M-kmeans-k1000-x20/clusters-20-final/ >> -o /user/issuu/output/lda-vecs-1M-kmeans-k1000-x20/clusterGroups >> >> Running on hadoop, using /Users/issuu/florian/hadoop-1.0.3/bin/hadoop >> and HADOOP_CONF_DIR= >> MAHOUT-JOB: >> /Users/issuu/florian/mahout/mahout-distribution-0.7/mahout-examples-0.7-job.jar >> 13/03/11 16:34:55 INFO common.AbstractJob: Command line arguments: >> {--endPhase=[2147483647], >> --input=[output/lda-vecs-1M-kmeans-k1000-x20/clusters-20-final/], >> --method=[mapreduce], >> --output=[/user/issuu/output/lda-vecs-1M-kmeans-k1000-x20/clusterGroups], >> --startPhase=[0], --tempDir=[temp]} >> 2013-03-11 16:34:55.925 java[79653:1203] Unable to load realm info >> from SCDynamicStore >> >> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 >> at >> org.apache.mahout.clustering.topdown.postprocessor.ClusterCountReader.getNumberOfClusters(ClusterCountReader.java:53) >> at >> org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.postProcessMR(ClusterOutputPostProcessorDriver.java:150) >> at >> org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.run(ClusterOutputPostProcessorDriver.java:104) >> at >> org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.run(ClusterOutputPostProcessorDriver.java:70) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.main(ClusterOutputPostProcessorDriver.java:81) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >> I'm running Mahout 0.7 on Hadoop 1.0.3 in pseudo-distributed mode. >> >> What am I doing wrong? >> What is the recommended way to get the output of a large(-ish) >> clustering job into some human-readable format? >> >> >> Thanks, >> >> Florian >>
