Hi Lance, got the shell script working already, thanks :)
is actually still looking for workaround to the original problem. If dumping takes that much of resources, is there a way to do it so that it won't (or reduce the chance) end up in OME? Best wishes, Jeffrey04 ----- Original Message ----- > From: Lance Norskog <[email protected]> > To: [email protected]; Jeffrey <[email protected]> > Cc: > Sent: Wednesday, July 27, 2011 4:15 PM > Subject: Re: fkmeans or Cluster Dumper not working? > >T he fix got checked in this afternoon. The problem is that a line in > the shell script surrounds mahout-examples-*.job with quotes. This > makes it not "glob expand the wildcard" to find the actual job file. > > look in the bin/mahout shell script, around line 127 > > On 7/27/11, Jeffrey <[email protected]> wrote: >> erm, is there any workaround to the problem? >> >> >> ----- Original Message ----- >>> From: Jeff Eastman <[email protected]> >>> To: "[email protected]" <[email protected]> >>> Cc: >>> Sent: Tuesday, July 26, 2011 1:12 PM >>> Subject: RE: fkmeans or Cluster Dumper not working? >>> >>> Also makes sense that fuzzyk centroids would be completely dense, since >>> every >>> point is a member of every cluster. My reducer heaps are 4G. >>> >>> -----Original Message----- >>> From: Jeff Eastman [mailto:[email protected]] >>> Sent: Monday, July 25, 2011 2:32 PM >>> To: [email protected]; Jeffrey >>> Subject: RE: fkmeans or Cluster Dumper not working? >>> >>> I'm able to run fuzzyk on your data set with k=10 and k=50 without >>> problems. >>> I also ran it fine with k=100 just to push it a bit harder. Runs took >>> longer as >>> k increased as expected (39s, 2m50s, 5m57s) as did the clustering (11s, >>> 45s, >>> 1m11s). The cluster dumper is throwing an OME with your data points and >>> probably >>> also with the larger cluster volumes, suggesting it needs a larger -Xmx >>> value >>> since it is running locally and not influenced by the cluster vm >>> parameters. >>> >>> I will try some more and keep you updated. >>> >>> The cluster dumper is throwing an OME trying to inhale all your data >>> points. It >>> is running locally >>> >>> -----Original Message----- >>> From: Jeffrey [mailto:[email protected]] >>> Sent: Sunday, July 24, 2011 12:51 AM >>> To: [email protected] >>> Subject: Re: fkmeans or Cluster Dumper not working? >>> >>> Erm, is there any update? is the problem reproducible? >>> >>> Best wishes, >>> Jeffrey04 >>> >>> >>> >>>> ________________________________ >>>> From: Jeffrey <[email protected]> >>>> To: Jeff Eastman <[email protected]>; >>> "[email protected]" <[email protected]> >>>> Sent: Friday, July 22, 2011 12:40 AM >>>> Subject: Re: fkmeans or Cluster Dumper not working? >>>> >>>> >>>> Hi Jeff, >>>> >>>> >>>> lol, this is probably my last reply before i fall asleep (GMT+8 > here). >>>> >>>> >>>> First thing first, data file is here: > http://coolsilon.com/image-tag.mvc >>>> >>>> >>>> Q: What is the cardinality of your vector data? >>>> about 1000+ rows (resources) * 14 000+ columns (tags) >>>> Q: Is it sparse or dense? >>>> sparse (assuming sparse = each vector contains mostly 0) >>>> Q: How many vectors are you trying to cluster? >>>> all of them? (1000+ rows) >>>> Q: What is the exact error you see when fkmeans fails with k=10? > With >>>> k=50? >>>> i think i posted the exception when k=50, but will post them again > here >>>> >>>> >>>> k=10, fkmeans actually works, but cluster dumper returns exception, >>>> however, >>> if i take out --pointsDir, then it would work (output looks ok, but >>> without all >>> the points) >>>> >>>> >>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>> --overwrite >>> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5 >>>> ... >>>> $ bin/mahout clusterdump --seqFileDir > sensei/clusters/clusters-1 >>> --pointsDir sensei/clusters/clusteredPoints --output >>> image-tag-clusters.txt >>> Running on hadoop, using >>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>> > HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>> MAHOUT-JOB: >>> > /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>> 11/07/22 00:14:50 INFO common.AbstractJob: Command line > arguments: >>> {--dictionaryType=text, --endPhase=2147483647, >>> --output=image-tag-clusters.txt, >>> --pointsDir=sensei/clusters/clusteredPoints, >>> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, > --tempDir=temp} >>>> Exception in thread "main" > java.lang.OutOfMemoryError: Java >>> heap space >>>> at java.lang.Object.clone(Native Method) >>>> at >>> org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44) >>>> at >>> org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39) >>>> at >>> > org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94) >>>> at >>> > org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55) >>>> at >>> > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) >>>> at >>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) >>>> at >>> > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) >>>> at >>> > org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) >>>> at >>> > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) >>>> at >>> > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) >>>> at >>> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) >>>> at >>> > com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) >>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255) >>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209) >>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123) >>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>> Method) >>>> at >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>> at >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:616) >>>> at >>> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>> at >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>> at >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>> Method) >>>> at >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>> at >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:616) >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>>> $ bin/mahout clusterdump --seqFileDir > sensei/clusters/clusters-1 >>> --output image-tag-clusters.txt Running on hadoop, using >>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>> > HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>> MAHOUT-JOB: >>> > /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>> 11/07/22 00:19:04 INFO common.AbstractJob: Command line > arguments: >>> {--dictionaryType=text, --endPhase=2147483647, >>> --output=image-tag-clusters.txt, >>> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, > --tempDir=temp} >>>> 11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 > ms >>>> >>>> >>>> k=50, fkmeans shows exception after map 100% reduce 0%, and would > retry >>>> (map >>> 0% reduce 0%) after the exception >>>> >>>> >>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>> --overwrite >>> --emitMostLikely false --numClusters 50 --maxIter 10 --m 5 >>>> Running on hadoop, using >>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>> > HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>> MAHOUT-JOB: >>> > /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>> 11/07/22 00:21:07 INFO common.AbstractJob: Command line > arguments: >>> {--clustering=null, --clusters=sensei/clusters/clusters-0, >>> --convergenceDelta=0.5, >>> > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, >>> >>> --emitMostLikely=false, --endPhase=2147483647, >>> --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, >>> --method=mapreduce, >>> --numClusters=50, --output=sensei/clusters, --overwrite=null, >>> --startPhase=0, >>> --tempDir=temp, --threshold=0} >>>> 11/07/22 00:21:09 INFO common.HadoopUtil: Deleting > sensei/clusters >>>> 11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the >>>> native-hadoop >>> library >>>> 11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded > & >>> initialized native-zlib library >>>> 11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new > compressor >>>> 11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new > decompressor >>>> 11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 > vectors >>>> to >>> sensei/clusters/clusters-0/part-randomSeed >>>> 11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy > K-Means >>> Iteration 1 >>>> 11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths > to >>> process : 1 >>>> 11/07/22 00:21:26 INFO mapred.JobClient: Running job: >>> job_201107211512_0029 >>>> 11/07/22 00:21:27 INFO mapred.JobClient: map 0% reduce 0% >>>> 11/07/22 00:22:08 INFO mapred.JobClient: map 1% reduce 0% >>>> 11/07/22 00:22:20 INFO mapred.JobClient: map 2% reduce 0% >>>> 11/07/22 00:22:33 INFO mapred.JobClient: map 3% reduce 0% >>>> 11/07/22 00:22:42 INFO mapred.JobClient: map 4% reduce 0% >>>> 11/07/22 00:22:50 INFO mapred.JobClient: map 5% reduce 0% >>>> 11/07/22 00:23:00 INFO mapred.JobClient: map 6% reduce 0% >>>> 11/07/22 00:23:09 INFO mapred.JobClient: map 7% reduce 0% >>>> 11/07/22 00:23:18 INFO mapred.JobClient: map 8% reduce 0% >>>> 11/07/22 00:23:27 INFO mapred.JobClient: map 9% reduce 0% >>>> 11/07/22 00:23:33 INFO mapred.JobClient: map 10% reduce 0% >>>> 11/07/22 00:23:42 INFO mapred.JobClient: map 11% reduce 0% >>>> 11/07/22 00:23:45 INFO mapred.JobClient: map 12% reduce 0% >>>> 11/07/22 00:23:54 INFO mapred.JobClient: map 13% reduce 0% >>>> 11/07/22 00:24:03 INFO mapred.JobClient: map 14% reduce 0% >>>> 11/07/22 00:24:09 INFO mapred.JobClient: map 15% reduce 0% >>>> 11/07/22 00:24:15 INFO mapred.JobClient: map 16% reduce 0% >>>> 11/07/22 00:24:24 INFO mapred.JobClient: map 17% reduce 0% >>>> 11/07/22 00:24:30 INFO mapred.JobClient: map 18% reduce 0% >>>> 11/07/22 00:24:42 INFO mapred.JobClient: map 19% reduce 0% >>>> 11/07/22 00:24:51 INFO mapred.JobClient: map 20% reduce 0% >>>> 11/07/22 00:24:57 INFO mapred.JobClient: map 21% reduce 0% >>>> 11/07/22 00:25:06 INFO mapred.JobClient: map 22% reduce 0% >>>> 11/07/22 00:25:09 INFO mapred.JobClient: map 23% reduce 0% >>>> 11/07/22 00:25:19 INFO mapred.JobClient: map 24% reduce 0% >>>> 11/07/22 00:25:25 INFO mapred.JobClient: map 25% reduce 0% >>>> 11/07/22 00:25:31 INFO mapred.JobClient: map 26% reduce 0% >>>> 11/07/22 00:25:37 INFO mapred.JobClient: map 27% reduce 0% >>>> 11/07/22 00:25:43 INFO mapred.JobClient: map 28% reduce 0% >>>> 11/07/22 00:25:51 INFO mapred.JobClient: map 29% reduce 0% >>>> 11/07/22 00:25:58 INFO mapred.JobClient: map 30% reduce 0% >>>> 11/07/22 00:26:04 INFO mapred.JobClient: map 31% reduce 0% >>>> 11/07/22 00:26:10 INFO mapred.JobClient: map 32% reduce 0% >>>> 11/07/22 00:26:19 INFO mapred.JobClient: map 33% reduce 0% >>>> 11/07/22 00:26:25 INFO mapred.JobClient: map 34% reduce 0% >>>> 11/07/22 00:26:34 INFO mapred.JobClient: map 35% reduce 0% >>>> 11/07/22 00:26:40 INFO mapred.JobClient: map 36% reduce 0% >>>> 11/07/22 00:26:49 INFO mapred.JobClient: map 37% reduce 0% >>>> 11/07/22 00:26:55 INFO mapred.JobClient: map 38% reduce 0% >>>> 11/07/22 00:27:04 INFO mapred.JobClient: map 39% reduce 0% >>>> 11/07/22 00:27:14 INFO mapred.JobClient: map 40% reduce 0% >>>> 11/07/22 00:27:23 INFO mapred.JobClient: map 41% reduce 0% >>>> 11/07/22 00:27:28 INFO mapred.JobClient: map 42% reduce 0% >>>> 11/07/22 00:27:34 INFO mapred.JobClient: map 43% reduce 0% >>>> 11/07/22 00:27:40 INFO mapred.JobClient: map 44% reduce 0% >>>> 11/07/22 00:27:49 INFO mapred.JobClient: map 45% reduce 0% >>>> 11/07/22 00:27:56 INFO mapred.JobClient: map 46% reduce 0% >>>> 11/07/22 00:28:05 INFO mapred.JobClient: map 47% reduce 0% >>>> 11/07/22 00:28:11 INFO mapred.JobClient: map 48% reduce 0% >>>> 11/07/22 00:28:20 INFO mapred.JobClient: map 49% reduce 0% >>>> 11/07/22 00:28:26 INFO mapred.JobClient: map 50% reduce 0% >>>> 11/07/22 00:28:35 INFO mapred.JobClient: map 51% reduce 0% >>>> 11/07/22 00:28:41 INFO mapred.JobClient: map 52% reduce 0% >>>> 11/07/22 00:28:47 INFO mapred.JobClient: map 53% reduce 0% >>>> 11/07/22 00:28:53 INFO mapred.JobClient: map 54% reduce 0% >>>> 11/07/22 00:29:02 INFO mapred.JobClient: map 55% reduce 0% >>>> 11/07/22 00:29:08 INFO mapred.JobClient: map 56% reduce 0% >>>> 11/07/22 00:29:17 INFO mapred.JobClient: map 57% reduce 0% >>>> 11/07/22 00:29:26 INFO mapred.JobClient: map 58% reduce 0% >>>> 11/07/22 00:29:32 INFO mapred.JobClient: map 59% reduce 0% >>>> 11/07/22 00:29:41 INFO mapred.JobClient: map 60% reduce 0% >>>> 11/07/22 00:29:50 INFO mapred.JobClient: map 61% reduce 0% >>>> 11/07/22 00:29:53 INFO mapred.JobClient: map 62% reduce 0% >>>> 11/07/22 00:29:59 INFO mapred.JobClient: map 63% reduce 0% >>>> 11/07/22 00:30:09 INFO mapred.JobClient: map 64% reduce 0% >>>> 11/07/22 00:30:15 INFO mapred.JobClient: map 65% reduce 0% >>>> 11/07/22 00:30:23 INFO mapred.JobClient: map 66% reduce 0% >>>> 11/07/22 00:30:35 INFO mapred.JobClient: map 67% reduce 0% >>>> 11/07/22 00:30:41 INFO mapred.JobClient: map 68% reduce 0% >>>> 11/07/22 00:30:50 INFO mapred.JobClient: map 69% reduce 0% >>>> 11/07/22 00:30:56 INFO mapred.JobClient: map 70% reduce 0% >>>> 11/07/22 00:31:05 INFO mapred.JobClient: map 71% reduce 0% >>>> 11/07/22 00:31:15 INFO mapred.JobClient: map 72% reduce 0% >>>> 11/07/22 00:31:24 INFO mapred.JobClient: map 73% reduce 0% >>>> 11/07/22 00:31:30 INFO mapred.JobClient: map 74% reduce 0% >>>> 11/07/22 00:31:39 INFO mapred.JobClient: map 75% reduce 0% >>>> 11/07/22 00:31:42 INFO mapred.JobClient: map 76% reduce 0% >>>> 11/07/22 00:31:50 INFO mapred.JobClient: map 77% reduce 0% >>>> 11/07/22 00:31:59 INFO mapred.JobClient: map 78% reduce 0% >>>> 11/07/22 00:32:11 INFO mapred.JobClient: map 79% reduce 0% >>>> 11/07/22 00:32:28 INFO mapred.JobClient: map 80% reduce 0% >>>> 11/07/22 00:32:37 INFO mapred.JobClient: map 81% reduce 0% >>>> 11/07/22 00:32:40 INFO mapred.JobClient: map 82% reduce 0% >>>> 11/07/22 00:32:49 INFO mapred.JobClient: map 83% reduce 0% >>>> 11/07/22 00:32:58 INFO mapred.JobClient: map 84% reduce 0% >>>> 11/07/22 00:33:04 INFO mapred.JobClient: map 85% reduce 0% >>>> 11/07/22 00:33:13 INFO mapred.JobClient: map 86% reduce 0% >>>> 11/07/22 00:33:19 INFO mapred.JobClient: map 87% reduce 0% >>>> 11/07/22 00:33:32 INFO mapred.JobClient: map 88% reduce 0% >>>> 11/07/22 00:33:38 INFO mapred.JobClient: map 89% reduce 0% >>>> 11/07/22 00:33:47 INFO mapred.JobClient: map 90% reduce 0% >>>> 11/07/22 00:33:52 INFO mapred.JobClient: map 91% reduce 0% >>>> 11/07/22 00:34:01 INFO mapred.JobClient: map 92% reduce 0% >>>> 11/07/22 00:34:10 INFO mapred.JobClient: map 93% reduce 0% >>>> 11/07/22 00:34:13 INFO mapred.JobClient: map 94% reduce 0% >>>> 11/07/22 00:34:25 INFO mapred.JobClient: map 95% reduce 0% >>>> 11/07/22 00:34:31 INFO mapred.JobClient: map 96% reduce 0% >>>> 11/07/22 00:34:40 INFO mapred.JobClient: map 97% reduce 0% >>>> 11/07/22 00:34:47 INFO mapred.JobClient: map 98% reduce 0% >>>> 11/07/22 00:34:56 INFO mapred.JobClient: map 99% reduce 0% >>>> 11/07/22 00:35:02 INFO mapred.JobClient: map 100% reduce 0% >>>> 11/07/22 00:35:07 INFO mapred.JobClient: Task Id : >>> attempt_201107211512_0029_m_000000_0, Status : FAILED >>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could > not find >>>> >>> any valid local directory for output/file.out >>>> at >>> > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) >>>> at >>> > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) >>>> at >>> > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) >>>> at >>> > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) >>>> at >>> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639) >>>> at >>> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) >>>> at >>> > org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) >>>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) >>>> at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:259) >>>> at java.security.AccessController.doPrivileged(Native > Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:416) >>>> at >>> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>> at org.apache.hadoop.mapred.Child.main(Child.java:253) >>>> >>>> >>>> 11/07/22 00:35:09 INFO mapred.JobClient: map 0% reduce 0% >>>> ... >>>> >>>> >>>> Q: What are the Hadoop heap settings you are using for your job? >>>> I am new to hadoop, not sure where to get those, but got these from >>> localhost:50070, is it right? >>>> 147 files and directories, 60 blocks = 207 total. Heap Size is > 31.57 MB / >>>> >>> 966.69 MB (3%) >>>> >>>> >>>> p/s: i keep forgetting to include my operating environment, sorry. > I >>> basically run this in a guest operating system (in a virtualbox virtual >>> machine), assigned 1 CPU core, and 1.5GB of memory. Then the host >>> operating >>> system is OS X 10.6.8 running on alubook (macbook late 2008 model) with >>> 4GB of >>> memory. >>>> >>>> >>>> $ cat /etc/*-release >>>> DISTRIB_ID=Ubuntu >>>> DISTRIB_RELEASE=11.04 >>>> DISTRIB_CODENAME=natty >>>> DISTRIB_DESCRIPTION="Ubuntu 11.04" >>>> $ uname -a >>>> Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 > 15:05:41 UTC >>>> >>> 2011 i686 i686 i386 GNU/Linux >>>> >>>> >>>> Best wishes, >>>> Jeffrey04 >>>> >>>>> ________________________________ >>>>> From: Jeff Eastman <[email protected]> >>>>> To: "[email protected]" > <[email protected]>; >>> Jeffrey <[email protected]> >>>>> Sent: Thursday, July 21, 2011 11:54 PM >>>>> Subject: RE: fkmeans or Cluster Dumper not working? >>>>> >>>>> Excellent, so this appears to be localized to fuzzyk. > Unfortunately, the >>>>> >>> Apache mail server strips off attachments so you'd need another > mechanism >>> (a >>> JIRA?) to upload your data if it is not too large. Some more questions > in >>> the >>> interim: >>>>> >>>>> - What is the cardinality of your vector data? >>>>> - Is it sparse or dense? >>>>> - How many vectors are you trying to cluster? >>>>> - What is the exact error you see when fkmeans fails with k=10? > With >>> k=50? >>>>> - What are the Hadoop heap settings you are using for your job? >>>>> >>>>> -----Original Message----- >>>>> From: Jeffrey [mailto:[email protected]] >>>>> Sent: Thursday, July 21, 2011 11:21 AM >>>>> To: [email protected] >>>>> Subject: Re: fkmeans or Cluster Dumper not >>> working? >>>>> >>>>> Hi Jeff, >>>>> >>>>> Q: Did you change your invocation to specify a different -c > directory >>> (e.g. clusters-0)? >>>>> A: Yes :) >>>>> >>>>> Q: Did you add the -cl argument? >>>>> A: Yes :) >>>>> >>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>> --overwrite >>> --emitMostLikely false --numClusters 5 --maxIter 10 --m 5 >>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>> --overwrite >>> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5 >>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>> --overwrite >>> --emitMostLikely false --numClusters 50 --maxIter 10 --m 5 >>>>> >>>>> Q: What is the new CLI invocation for clusterdump? >>>>> A: >>>>> $ bin/mahout clusterdump --seqFileDir > sensei/clusters/clusters-4 >>> --pointsDir >>> sensei/clusters/clusteredPoints --output image-tag-clusters.txt >>>>> >>>>> >>>>> Q: Did this work for -k 10? What happens with -k 50? >>>>> A: works for k=5 (but i don't see the points), but not > k=10, fkmeans >>> fails when k=50, so i can't dump when k=50 >>>>> >>>>> Q: Have you tried kmeans? >>>>> A: Yes (all tested on 0.6-snapshot) >>>>> >>>>> k=5: no problem :) >>>>> k=10: no problem :) >>>>> k=50: no problem :) >>>>> >>>>> p/s: attached with the test data i used (in mvc format), let me > know if >>> you guys prefer raw data in arff format >>>>> >>>>> Best wishes, >>>>> Jeffrey04 >>>>> >>>>> >>>>> >>>>>> ________________________________ >>>>>> From: Jeff Eastman <[email protected]> >>>>>> To: "[email protected]" >>> <[email protected]>; Jeffrey <[email protected]> >>>>>> Sent: Thursday, July 21, 2011 9:36 PM >>>>>> Subject: RE: fkmeans or Cluster Dumper not working? >>>>>> >>>>>> You are correct, the wiki for fkmeans did not mention the > -cl >>> argument. I've added that just now. I think this is what Frank > means in >>> his >>> comment but you do *not* have to write any custom code to get the > cluster >>> dumper >>> to do what you want, just use the -cl argument and specify > clusteredPoints >>> as >>> the -p input to clusterdump. >>>>>> >>>>>> Check out TestClusterDumper.testKmeans and > .testFuzzyKmeans. These >>> show how to invoke the clustering and cluster dumper from Java at > least. >>>>>> >>>>>> Did you change your invocation to specify a different -c > directory >>> (e.g. clusters-0)? >>>>>> Did you add the -cl argument? >>>>>> What is the new CLI invocation for clusterdump? >>>>>> Did this work for -k 10? What happens with -k >>> 50? >>>>>> Have you tried kmeans? >>>>>> >>>>>> I can help you better if you will give me answers to my > questions >>>>>> >>>>>> -----Original Message----- >>>>>> From: Jeffrey [mailto:[email protected]] >>>>>> Sent: Thursday, July 21, 2011 4:30 AM >>>>>> To: [email protected] >>>>>> Subject: Re: fkmeans or Cluster Dumper not working? >>>>>> >>>>>> Hi again, >>>>>> >>>>>> Let me update on what's working and what's not > working. >>>>>> >>>>>> Works: >>>>>> fkmeans clustering (10 clusters) - thanks Jeff for the --cl > tip >>>>>> fkmeans clustering (5 clusters) >>>>>> clusterdump (5 clusters) - so points are not included in > the >>> clusterdump and I need to write a program for it? >>>>>> >>>>>> Not Working: >>>>>> fkmeans clustering (50 clusters) - same error >>>>>> clusterdump (10 >>> clusters) - same error >>>>>> >>>>>> >>>>>> so it seems to attach points to the cluster dumper output > like the >>> synthetic control example does, i would have to write some code as > pointed >>> by >>> @Frank_Scholten ? >>> https://twitter.com/#!/Frank_Scholten/status/93617269296472064 >>>>>> >>>>>> Best wishes, >>>>>> Jeffrey04 >>>>>> >>>>>>> ________________________________ >>>>>>> From: Jeff Eastman <[email protected]> >>>>>>> To: "[email protected]" >>> <[email protected]>; Jeffrey <[email protected]> >>>>>>> Sent: Wednesday, July 20, 2011 11:53 PM >>>>>>> Subject: RE: fkmeans or Cluster Dumper not working? >>>>>>> >>>>>>> Hi Jeffrey, >>>>>>> >>>>>>> It is always difficult to debug remotely, but here are > some >>> suggestions: >>>>>>> - First, you are specifying both an input clusters > directory >>> --clusters and --numClusters clusters so the job is sampling 10 points >>> from your >>> input data set and writing them to clusteredPoints as the prior > clusters >>> for the >>> first iteration. You should pick a different name for this directory, > as >>> the >>> clusteredPoints directory is used by the -cl (--clustering) option > (which >>> you >>> did not supply) to write out the clustered (classified) input vectors. >>> When you >>> subsequently supplied clusteredPoints to the clusterdumper it was >>> expecting a >>> different format and that caused the exception you saw. Change your >>> --clusters >>> directory (clusters-0 is good) >>> and add a -cl argument and things should go more smoothly. The -cl > option >>> is not >>> the default and so no clustering of the input points is performed > without >>> this >>> (Many people get caught by this and perhaps the default should be > changed, >>> but >>> clustering can be expensive and so it is not performed without > request). >>>>>>> - If you still have problems, try again with k-means. > The >>> similarity to fkmeans is good and it will eliminate fkmeans itself if > you >>> see >>> the same problems with k-means >>>>>>> - I don't see why changing the -k argument from 10 > to 50 >>> should cause any problems, unless your vectors are very large and you > are >>> getting an OME in the reducer. Since the reducer is calculating > centroid >>> vectors >>> for the next iteration these will become more dense and memory will >>> increase >>> substantially. >>>>>>> - I can't figure out what might be causing your > second >>> exception. It is bombing inside of Hadoop file IO and this causes me to >>> suspect >>> command argument >>> problems. >>>>>>> >>>>>>> Hope this helps, >>>>>>> Jeff >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Jeffrey [mailto:[email protected]] >>>>>>> Sent: Wednesday, July 20, 2011 2:41 AM >>>>>>> To: [email protected] >>>>>>> Subject: fkmeans or Cluster Dumper not working? >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am trying to generate clusters using the fkmeans > command line >>> tool from my test data. Not sure if this is correct, as it only runs > one >>> iteration (output from 0.6-snapshot, gotta use some workaround to some >>> weird bug >>> - >>> > http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans >>> >>> ) >>>>>>> >>>>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc > --output >>> sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 >>> --numClusters 10 >>> --overwrite --m 5 >>>>>>> Running on hadoop, using >>> > HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB: >>> >>> > /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20 >>> >>> 14:05:18 INFO common.AbstractJob: Command line arguments: >>> {--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, >>> > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, >>> >>> --emitMostLikely=true, --endPhase=2147483647, >>> --input=sensei/image-tag.arff.mvc, >>> --m=5, --maxIter=10, --method=mapreduce, --numClusters=10, >>> --output=sensei/clusters, --overwrite=null, --startPhase=0, >>> --tempDir=temp, >>> --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: Deleting >>> sensei/clusters11/07/20 >>> 14:05:20 INFO common.HadoopUtil: Deleting > sensei/clusteredPoints11/07/20 >>> 14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop >>> library11/07/20 >>> 14:05:20 INFO zlib.ZlibFactory: Successfully >>>>>>> loaded & initialized native-zlib library11/07/20 > 14:05:20 >>> INFO compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO >>> compress.CodecPool: Got brand-new decompressor >>>>>>> 11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: > Wrote 10 >>> vectors to sensei/clusteredPoints/part-randomSeed >>>>>>> 11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: > Fuzzy >>> K-Means Iteration 1 >>>>>>> 11/07/20 14:05:30 INFO input.FileInputFormat: Total > input paths >>> to process : 1 >>>>>>> 11/07/20 14:05:30 INFO mapred.JobClient: Running job: >>> job_201107201152_0021 >>>>>>> 11/07/20 14:05:31 INFO mapred.JobClient: map 0% reduce > 0% >>>>>>> 11/07/20 14:05:54 INFO mapred.JobClient: map 2% reduce > 0% >>>>>>> 11/07/20 14:05:57 INFO >>> mapred.JobClient: map 5% reduce 0% >>>>>>> 11/07/20 14:06:00 INFO mapred.JobClient: map 6% reduce > 0% >>>>>>> 11/07/20 14:06:03 INFO mapred.JobClient: map 7% reduce > 0% >>>>>>> 11/07/20 14:06:07 INFO mapred.JobClient: map 10% > reduce 0% >>>>>>> 11/07/20 14:06:10 INFO mapred.JobClient: map 13% > reduce 0% >>>>>>> 11/07/20 14:06:13 INFO mapred.JobClient: map 15% > reduce 0% >>>>>>> 11/07/20 14:06:16 INFO mapred.JobClient: map 17% > reduce 0% >>>>>>> 11/07/20 14:06:19 INFO mapred.JobClient: map 19% > reduce 0% >>>>>>> 11/07/20 14:06:22 INFO mapred.JobClient: map 23% > reduce 0% >>>>>>> 11/07/20 14:06:25 INFO mapred.JobClient: map 25% > reduce 0% >>>>>>> 11/07/20 14:06:28 INFO mapred.JobClient: map 27% > reduce 0% >>>>>>> 11/07/20 14:06:31 INFO mapred.JobClient: map 30% > reduce 0% >>>>>>> 11/07/20 14:06:34 INFO mapred.JobClient: map 33% > reduce >>> 0% >>>>>>> 11/07/20 14:06:37 INFO mapred.JobClient: map 36% > reduce 0% >>>>>>> 11/07/20 14:06:40 INFO mapred.JobClient: map 37% > reduce 0% >>>>>>> 11/07/20 14:06:43 INFO mapred.JobClient: map 40% > reduce 0% >>>>>>> 11/07/20 14:06:46 INFO mapred.JobClient: map 43% > reduce 0% >>>>>>> 11/07/20 14:06:49 INFO mapred.JobClient: map 46% > reduce 0% >>>>>>> 11/07/20 14:06:52 INFO mapred.JobClient: map 48% > reduce 0% >>>>>>> 11/07/20 14:06:55 INFO mapred.JobClient: map 50% > reduce 0% >>>>>>> 11/07/20 14:06:57 INFO mapred.JobClient: map 53% > reduce 0% >>>>>>> 11/07/20 14:07:00 INFO mapred.JobClient: map 56% > reduce 0% >>>>>>> 11/07/20 14:07:03 INFO mapred.JobClient: map 58% > reduce 0% >>>>>>> 11/07/20 14:07:06 INFO mapred.JobClient: map 60% > reduce 0% >>>>>>> 11/07/20 14:07:09 INFO mapred.JobClient: map 63% > reduce 0% >>>>>>> 11/07/20 14:07:13 INFO >>> mapred.JobClient: map 65% reduce 0% >>>>>>> 11/07/20 14:07:16 INFO mapred.JobClient: map 67% > reduce 0% >>>>>>> 11/07/20 14:07:19 INFO mapred.JobClient: map 70% > reduce 0% >>>>>>> 11/07/20 14:07:22 INFO mapred.JobClient: map 73% > reduce 0% >>>>>>> 11/07/20 14:07:25 INFO mapred.JobClient: map 75% > reduce 0% >>>>>>> 11/07/20 14:07:28 INFO mapred.JobClient: map 77% > reduce 0% >>>>>>> 11/07/20 14:07:31 INFO mapred.JobClient: map 80% > reduce 0% >>>>>>> 11/07/20 14:07:34 INFO mapred.JobClient: map 83% > reduce 0% >>>>>>> 11/07/20 14:07:37 INFO mapred.JobClient: map 85% > reduce 0% >>>>>>> 11/07/20 14:07:40 INFO mapred.JobClient: map 87% > reduce 0% >>>>>>> 11/07/20 14:07:43 INFO mapred.JobClient: map 89% > reduce 0% >>>>>>> 11/07/20 14:07:46 INFO mapred.JobClient: map 92% > reduce 0% >>>>>>> 11/07/20 14:07:49 INFO mapred.JobClient: map 95% > reduce >>> 0% >>>>>>> 11/07/20 14:07:55 INFO mapred.JobClient: map 98% > reduce 0% >>>>>>> 11/07/20 14:07:59 INFO mapred.JobClient: map 99% > reduce 0% >>>>>>> 11/07/20 14:08:02 INFO mapred.JobClient: map 100% > reduce 0% >>>>>>> 11/07/20 14:08:23 INFO mapred.JobClient: map 100% > reduce 100% >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Job complete: >>> job_201107201152_0021 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Job Counters >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Launched > reduce >>> tasks=1 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: >>> SLOTS_MILLIS_MAPS=149314 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Total time > spent by >>> all reduces waiting after reserving slots (ms)=0 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Total time > spent by >>> all maps waiting after >>> reserving slots (ms)=0 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Launched > map >>> tasks=1 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Data-local > map >>> tasks=1 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: >>> SLOTS_MILLIS_REDUCES=15618 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: File Output > Format >>> Counters >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Bytes >>> Written=2247222 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Clustering >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Converged >>> Clusters=10 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: > FileSystemCounters >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: >>> FILE_BYTES_READ=130281382 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: >>> HDFS_BYTES_READ=254494 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: >>> FILE_BYTES_WRITTEN=132572666 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: >>> HDFS_BYTES_WRITTEN=2247222 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: File Input > Format >>> Counters >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Bytes > Read=247443 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Map-Reduce > Framework >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Reduce > input >>> groups=10 >>>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Map output >>> materialized bytes=2246233 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Combine > output >>> records=330 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Map input >>> records=1113 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Reduce > shuffle >>> bytes=2246233 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Reduce > output >>> records=10 >>>>>>> 11/07/20 14:08:32 INFO >>> mapred.JobClient: Spilled Records=590 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Map output >>> bytes=2499995001 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Combine > input >>> records=11450 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Map output >>> records=11130 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: > SPLIT_RAW_BYTES=127 >>>>>>> 11/07/20 14:08:32 INFO mapred.JobClient: Reduce > input >>> records=10 >>>>>>> 11/07/20 14:08:32 INFO driver.MahoutDriver: Program > took 194096 >>> ms >>>>>>> >>>>>>> if I increase the --numClusters argument (e.g. 50), > then it will >>> return exception after >>>>>>> 11/07/20 14:08:02 INFO mapred.JobClient: map 100% > reduce 0% >>>>>>> >>>>>>> and would retry again (also reproducible using > 0.6-snapshot) >>>>>>> >>>>>>> ... >>>>>>> 11/07/20 14:22:25 INFO mapred.JobClient: map 100% > reduce >>> 0% >>>>>>> 11/07/20 14:22:30 INFO mapred.JobClient: Task Id : >>> attempt_201107201152_0022_m_000000_0, Status : FAILED >>>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: > Could not >>> find any valid local directory for output/file.out >>>>>>> at >>> > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) >>>>>>> at >>> > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) >>>>>>> at >>> > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) >>>>>>> at >>> > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) >>>>>>> at >>> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639) >>>>>>> at >>> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) >>>>>>> at >>> > org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) >>>>>>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) >>>>>>> at >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >>>>>>> at > org.apache.hadoop.mapred.Child$4.run(Child.java:259) >>>>>>> at > java.security.AccessController.doPrivileged(Native >>> Method) >>>>>>> at > javax.security.auth.Subject.doAs(Subject.java:416) >>>>>>> at >>> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>>>>> at > org.apache.hadoop.mapred.Child.main(Child.java:253) >>>>>>> >>>>>>> 11/07/20 14:22:32 INFO >>> mapred.JobClient: map 0% reduce 0% >>>>>>> ... >>>>>>> >>>>>>> Then I ran cluster dumper to dump information about the >>> clusters, this command would work if I only care about the cluster >>> centroids >>> (both 0.5 release and 0.6-snapshot) >>>>>>> >>>>>>> $ bin/mahout clusterdump --seqFileDir > sensei/clusters/clusters-1 >>> --output image-tag-clusters.txt >>>>>>> Running on hadoop, using >>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>>>>> > HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>>>>> MAHOUT-JOB: >>> > /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>>>>> 11/07/20 14:33:45 INFO common.AbstractJob: Command line >>> arguments: {--dictionaryType=text, --endPhase=2147483647, >>> --output=image-tag-clusters.txt, > --seqFileDir=sensei/clusters/clusters-1, >>> --startPhase=0, --tempDir=temp} >>>>>>> 11/07/20 14:33:56 INFO driver.MahoutDriver: Program > took 11761 >>> ms >>>>>>> >>>>>>> but if I want to see the degree of membership of each > points, I >>> get another exception (yes, reproducible for both 0.5 release and >>> 0.6-snapshot) >>>>>>> >>>>>>> $ bin/mahout clusterdump --seqFileDir > sensei/clusters/clusters-1 >>> --output image-tag-clusters.txt --pointsDir sensei/clusteredPoints >>>>>>> Running on hadoop, using >>> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>>>>> > HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>>>>> MAHOUT-JOB: >>> > /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>>>>> 11/07/20 14:35:08 INFO common.AbstractJob: Command line >>> arguments: {--dictionaryType=text, --endPhase=2147483647, >>> --output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, >>> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, > --tempDir=temp} >>>>>>> 11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded > the >>> native-hadoop >>> library >>>>>>> 11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully > loaded >>> & initialized native-zlib library >>>>>>> 11/07/20 14:35:10 INFO compress.CodecPool: Got > brand-new >>> decompressor >>>>>>> Exception in thread "main" >>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast > to >>> org.apache.hadoop.io.IntWritable >>>>>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261) >>>>>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209) >>>>>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123) >>>>>>> at >>> > org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89) >>>>>>> at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>>>>>> >>> at >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>>>> at >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>> at > java.lang.reflect.Method.invoke(Method.java:616) >>>>>>> at >>> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>>>> at >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>>>> at >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) >>>>>>> at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>>>>>> at >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>>>> at >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>> at > java.lang.reflect.Method.invoke(Method.java:616) >>>>>>> at > org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>>>>>> >>>>>>> erm, would writing a short program to call the API > (btw, >>> can't seem to find the latest API doc?) be a better choice here? Or > did I >>> do >>> anything wrong here (yes, Java is not my main language, and I am very > new >>> to >>> Mahout.. and h)? >>>>>>> >>>>>>> the data is converted from an arff file with about 1000 > rows >>> (resource) and 14k columns (tag), and it is just a subset of my data. >>> (actually >>> made a mistake so it is now generating resource clusters instead of tag >>> clusters, but I am just doing this as a proof of concept whether mahout > is >>> good >>> enough for the task) >>>>>>> >>>>>>> Best >>> wishes, >>>>>>> Jeffrey04 >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> > > > -- > Lance Norskog > [email protected] >
