I received a FILENOTFOUNDException while running mahout 0.7. It says that the kmeans function in mahout does not have the clusterdump utility. Now i am thinking of copying the cluster output to run in older versions of mahout. Any help and comments would be appreciated. On Oct 7, 2013 11:57 PM, "Adam Baron" <[email protected]> wrote:
> I just tried using Mahout 0.8 and am still seeing the same issue. Any > ideas? Is org.apache.mahout.utils.clustering.ClusterDumper working for > other folks? > > I get this error from the command line: > Exception in thread "main" java.lang.IllegalStateException: Job failed! > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.runIterationMR(RepresentativePointsDriver.java:252) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.runIteration(RepresentativePointsDriver.java:165) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.run(RepresentativePointsDriver.java:127) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.run(RepresentativePointsDriver.java:90) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsDriver.main(RepresentativePointsDriver.java:67) > at > > org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:198) > at > > org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:156) > at > > org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:208) > > > > > > > And this is from the log file of a failed mapper: > 2013-10-07 12:51:46,773 WARN mapreduce.Counters: Group > org.apache.hadoop.mapred.Task$Counter is deprecated. Use > org.apache.hadoop.mapreduce.TaskCounter instead > 2013-10-07 12:51:47,073 WARN org.apache.hadoop.conf.Configuration: > session.id is deprecated. Instead, use dfs.metrics.session-id > 2013-10-07 12:51:47,074 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > Initializing JVM Metrics with processName=MAP, sessionId= > 2013-10-07 12:51:47,116 WARN org.apache.hadoop.conf.Configuration: > slave.host.name is deprecated. Instead, use dfs.datanode.hostname > 2013-10-07 12:51:47,358 INFO org.apache.hadoop.util.ProcessTree: setsid > exited with exit code 0 > 2013-10-07 12:51:47,360 INFO org.apache.hadoop.mapred.Task: Using > ResourceCalculatorPlugin : > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4355d3a3 > 2013-10-07 12:51:47,555 INFO org.apache.hadoop.mapred.MapTask: Processing > split: <REDACTED> > 2013-10-07 12:51:47,559 INFO org.apache.hadoop.mapred.MapTask: Map output > collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer > 2013-10-07 12:51:47,562 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = > 256 > 2013-10-07 12:51:47,670 INFO org.apache.hadoop.mapred.MapTask: data buffer > = 204010960/255013696 > 2013-10-07 12:51:47,670 INFO org.apache.hadoop.mapred.MapTask: record > buffer = 671088/838860 > 2013-10-07 12:57:03,871 INFO org.apache.hadoop.mapred.TaskLogsTruncater: > Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 > 2013-10-07 12:57:03,873 FATAL org.apache.hadoop.mapred.Child: Error running > child : java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.<init>(HashMap.java:209) > at > > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:652) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:535) > at java.io.DataInputStream.readInt(DataInputStream.java:371) > at > > org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2258) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2289) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2193) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2239) > at > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:101) > at > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:40) > at > > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at > > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at com.google.common.collect.Iterators$5.hasNext(Iterators.java:543) > at > > com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:103) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:97) > at > > org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.setup(RepresentativePointsMapper.java:87) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > > Thanks, > Adam > > > > On Wed, Aug 7, 2013 at 11:38 PM, Ted Dunning <[email protected]> > wrote: > > > Mahout is a library. You can link against any version you like and still > > have a perfectly valid Hadoop program. > > > > > > > > > > On Wed, Aug 7, 2013 at 11:51 AM, Adam Baron <[email protected]> > > wrote: > > > > > Suneel, > > > > > > Unfortunately no, we're still on Mahout 0.7. My team is one of many > > teams > > > which share a large, centrally administrated Hadoop cluster. The > admins > > > are pretty strict about only installing official CDH releases. I don't > > > believe Mahout 0.8 is in an official CDH release yet. Has the > > > ClusterDumper code changed in 0.8? > > > > > > Regards, > > > Adam > > > > > > On Tue, Aug 6, 2013 at 9:00 PM, Suneel Marthi <[email protected] > > > >wrote: > > > > > > > Adam, > > > > > > > > Pardon my asking again if this has already been answered - Are you > > > running > > > > against Mahout 0.8? > > > > > > > > > > > > > > > > > > > > ------------------------------ > > > > *From:* Adam Baron <[email protected]> > > > > *To:* [email protected]; Suneel Marthi <[email protected] > > > > > > *Sent:* Tuesday, August 6, 2013 6:56 PM > > > > > > > > *Subject:* Re: How to get human-readable output for large clustering? > > > > > > > > Suneel, > > > > > > > > I was trying -n 25 and -b 100 when I sent my e-mail about it not > > working > > > > for me. Just tried -n 20 and got the same error message. Any other > > > ideas? > > > > > > > > Thanks, > > > > Adam > > > > > > > > On Mon, Aug 5, 2013 at 7:40 PM, Suneel Marthi < > [email protected] > > > >wrote: > > > > > > > > Adam/Florian, > > > > > > > > Could you try running the clusterdump by limiting the number of terms > > > from > > > > clusterdump, by specifying -n 20 (outputs the 20 top terms)? > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > From: Adam Baron <[email protected]> > > > > To: [email protected] > > > > Sent: Monday, August 5, 2013 8:03 PM > > > > Subject: Re: How to get human-readable output for large clustering? > > > > > > > > > > > > Florian, > > > > > > > > Any luck finding an answer over the past 5 months? I'm also dealing > > with > > > > similar out of memory errors when I run clusterdump. I'm using > 50,000 > > > > features and tried k=500. The kmeans command ran fine, but then I > got > > > the > > > > dreaded OutOfMemory error on with the clusterdump command: > > > > > > > > 2013-08-05 18:46:01,686 FATAL org.apache.hadoop.mapred.Child: Error > > > running > > > > child : java.lang.OutOfMemoryError: Java heap space > > > > at > > > > > > > > > > > > > > org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434) > > > > at > > > > > > > > > > > > > > org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) > > > > at > > > > > > > > > > > > > > org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139) > > > > at > > > > > > org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114) > > > > at > > > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242) > > > > at > > > > > > > > > > > > > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) > > > > at > > > > > > > > > > > > > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) > > > > at > > > > > > > > > > > > > > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > > > > at > > > > > > > > > > > > > > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > > > > at > > > > com.google.common.collect.Iterators$5.hasNext(Iterators.java:543) > > > > at > > > > > > > > > > > > > > com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) > > > > at > > > > > > > > > > > > > > org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:103) > > > > at > > > > > > > > > > > > > > org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:97) > > > > at > > > > > > > > > > > > > > org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.setup(RepresentativePointsMapper.java:87) > > > > at > > > > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) > > > > at > > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673) > > > > at > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) > > > > at > > > > org.apache.hadoop.mapred.Child$4.run(Child.java:268) > > > > at > > > > java.security.AccessController.doPrivileged(Native Method) > > > > at > > > > javax.security.auth.Subject.doAs(Subject.java:396) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > > > > at > > > > org.apache.hadoop.mapred.Child.main(Child.java:262) > > > > > > > > Thanks, > > > > Adam > > > > > > > > On Mon, Mar 11, 2013 at 8:42 AM, Florian Laws < > [email protected] > > > > >wrote: > > > > > > > > > Hi, > > > > > > > > > > I have difficulties with clusterdump and clusterpp. > > > > > > > > > > Experimenting with Mahout's k-means clusterer, I have successfully > > > > > obtained kmeans results and clusterdump output for small custom > > > > > datasets. > > > > > > > > > > However, when I try to use a larger dataset with 1M datapoints and > > > > > 1000 clusters, I start to run into problems with the clusterdump > > tool. > > > > > The kmeans step runs fine, but clusterdump fails with with an out > of > > > > > memory error: > > > > > > > > > > mahout clusterdump > > > > > -i output/lda-vecs-1M-kmeans-k1000-x20/clusters-*-final > > > > > -o output/lda-vecs-1M-kmeans-k1000-x20/clusterdump > > > > > -dt sequencefile > > > > > -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure > > > > > --pointsDir output/lda-vecs-1M-kmeans-k1000-x20/clusteredPoints/ > > > > > > > > > > Running on hadoop, using > /Users/issuu/florian/hadoop-1.0.3/bin/hadoop > > > > > and HADOOP_CONF_DIR= > > > > > MAHOUT-JOB: > > > > > > > > > > > > > > > /Users/issuu/florian/mahout/mahout-distribution-0.7/mahout-examples-0.7-job.jar > > > > > 13/03/11 16:21:04 INFO common.AbstractJob: Command line arguments: > > > > > {--dictionaryType=[sequencefile], > > > > > > > > > > > > > > > > > > > > --distanceMeasure=[org.apache.mahout.common.distance.EuclideanDistanceMeasure], > > > > > --endPhase=[2147483647], > > > > > --input=[output/lda-vecs-1M-kmeans-k1000-x20/clusters-*-final], > > > > > --output=[output/lda-vecs-1M-kmeans-k1000-x20/clusterdump], > > > > > --outputFormat=[TEXT], > > > > > --pointsDir=[output/lda-vecs-1M-kmeans-k1000-x20/clusteredPoints/], > > > > > --startPhase=[0], --tempDir=[temp]} > > > > > 2013-03-11 16:21:04.631 java[79418:1203] Unable to load realm info > > > > > from SCDynamicStore > > > > > > > > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap > > space > > > > > at > > > org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44) > > > > > at > > > org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39) > > > > > at > > > > > > > > > org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.classify.WeightedVectorWritable.readFields(WeightedVectorWritable.java:56) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1809) > > > > > at > > > > > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) > > > > > at > > > > > > > > > > > > > > > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) > > > > > at > > > > > > > > > > > > > > > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) > > > > > at > > > > > com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) > > > > > at > > > > > > > > > > > > > > > com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:292) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:245) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:152) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:102) > > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > > > > > at > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > > > at > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > > > > > at > > > > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > > > > at > > > > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > > > > > at > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > > > at > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > > > > > > > > > > > > How do I increase the heap space in a way that the mahout command > > > accepts > > > > > it? > > > > > I've set both MAHOUT_HEAPSIZE to 40000 and JAVA_HEAP_MAX to -Xmx40g > > > > > with no effect. > > > > > > > > > > > > > > > I also read that this is because clusterdump tries to read the > whole > > > > > clustering input into memory, > > > > > and that one one instead use clusterpp to separate the clusters, > so I > > > > > also tried to use the clusterpp utility as suggested in > > > > > > > > > > > > > > > > > > > > http://mail-archives.apache.org/mod_mbox/mahout-user/201210.mbox/%3CCAPa28_KPzBzhn5Ug0jQNstbzFOBMhADdfGTrcg2p1HCw5GXWWw%40mail.gmail.com%3E > > > > > > > > > > However that fails because it apparently can't determine the number > > of > > > > > clusters. > > > > > > > > > > $ mahout clusterpp > > > > > -i output/lda-vecs-1M-kmeans-k1000-x20/clusters-20-final/ > > > > > -o > /user/issuu/output/lda-vecs-1M-kmeans-k1000-x20/clusterGroups > > > > > > > > > > Running on hadoop, using > /Users/issuu/florian/hadoop-1.0.3/bin/hadoop > > > > > and HADOOP_CONF_DIR= > > > > > MAHOUT-JOB: > > > > > > > > > > > > > > > /Users/issuu/florian/mahout/mahout-distribution-0.7/mahout-examples-0.7-job.jar > > > > > 13/03/11 16:34:55 INFO common.AbstractJob: Command line arguments: > > > > > {--endPhase=[2147483647], > > > > > --input=[output/lda-vecs-1M-kmeans-k1000-x20/clusters-20-final/], > > > > > --method=[mapreduce], > > > > > > > > > --output=[/user/issuu/output/lda-vecs-1M-kmeans-k1000-x20/clusterGroups], > > > > > --startPhase=[0], --tempDir=[temp]} > > > > > 2013-03-11 16:34:55.925 java[79653:1203] Unable to load realm info > > > > > from SCDynamicStore > > > > > > > > > > Exception in thread "main" > java.lang.ArrayIndexOutOfBoundsException: > > 0 > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.topdown.postprocessor.ClusterCountReader.getNumberOfClusters(ClusterCountReader.java:53) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.postProcessMR(ClusterOutputPostProcessorDriver.java:150) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.run(ClusterOutputPostProcessorDriver.java:104) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.run(ClusterOutputPostProcessorDriver.java:70) > > > > > at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > > > at > > > > > > > > > > > > > > > org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver.main(ClusterOutputPostProcessorDriver.java:81) > > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > > > > > at > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > > > at > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > > > > > at > > > > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > > > > at > > > > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) > > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > > > > > at > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > > > at > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > > > > > > > I'm running Mahout 0.7 on Hadoop 1.0.3 in pseudo-distributed mode. > > > > > > > > > > What am I doing wrong? > > > > > What is the recommended way to get the output of a large(-ish) > > > > > clustering job into some human-readable format? > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Florian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
