No worries :)
>________________________________ >From: Jeff Eastman <[email protected]> >To: "[email protected]" <[email protected]>; Jeffrey ><[email protected]> >Sent: Tuesday, July 26, 2011 12:30 AM >Subject: RE: fkmeans or Cluster Dumper not working? > >Sorry, I was traveling over the weekend. I will take a look at your data asap. > >-----Original Message----- >From: Jeffrey [mailto:[email protected]] >Sent: Sunday, July 24, 2011 3:51 AM >To: [email protected] >Subject: Re: fkmeans or Cluster Dumper not working? > >Erm, is there any update? is the problem reproducible? > >Best wishes, >Jeffrey04 > > > >>________________________________ >>From: Jeffrey <[email protected]> >>To: Jeff Eastman <[email protected]>; "[email protected]" >><[email protected]> >>Sent: Friday, July 22, 2011 12:40 AM >>Subject: Re: fkmeans or Cluster Dumper not working? >> >> >>Hi Jeff, >> >> >>lol, this is probably my last reply before i fall asleep (GMT+8 here). >> >> >>First thing first, data file is here: http://coolsilon.com/image-tag.mvc >> >> >>Q: What is the cardinality of your vector data? >>about 1000+ rows (resources) * 14 000+ columns (tags) >>Q: Is it sparse or dense? >>sparse (assuming sparse = each vector contains mostly 0) >>Q: How many vectors are you trying to cluster? >>all of them? (1000+ rows) >>Q: What is the exact error you see when fkmeans fails with k=10? With k=50? >>i think i posted the exception when k=50, but will post them again here >> >> >>k=10, fkmeans actually works, but cluster dumper returns exception, however, >>if i take out --pointsDir, then it would work (output looks ok, but without >>all the points) >> >> >> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>--overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5 >> ... >> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 >>--pointsDir sensei/clusters/clusteredPoints --output image-tag-clusters.txt >>Running on hadoop, using >>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >> MAHOUT-JOB: >>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >> 11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments: >>{--dictionaryType=text, --endPhase=2147483647, >>--output=image-tag-clusters.txt, --pointsDir=sensei/clusters/clusteredPoints, >>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp} >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at java.lang.Object.clone(Native Method) >> at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44) >> at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39) >> at >>org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94) >> at >>org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55) >> at >>org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) >> at >>org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) >> at >>org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) >> at >>org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) >> at >>com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) >> at >>com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) >> at >>com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) >> at >>com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at >>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >>org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >>org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output >>image-tag-clusters.txt Running on hadoop, using >>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >> MAHOUT-JOB: >>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >> 11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments: >>{--dictionaryType=text, --endPhase=2147483647, >>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, >>--startPhase=0, --tempDir=temp} >> 11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms >> >> >>k=50, fkmeans shows exception after map 100% reduce 0%, and would retry (map >>0% reduce 0%) after the exception >> >> >> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>--overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5 >> Running on hadoop, using >>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >> MAHOUT-JOB: >>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >> 11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments: >>{--clustering=null, --clusters=sensei/clusters/clusters-0, >>--convergenceDelta=0.5, >>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, >> --emitMostLikely=false, --endPhase=2147483647, >>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, >>--numClusters=50, --output=sensei/clusters, --overwrite=null, --startPhase=0, >>--tempDir=temp, --threshold=0} >> 11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters >> 11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the native-hadoop >>library >> 11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded & >>initialized native-zlib library >> 11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor >> 11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor >> 11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors to >>sensei/clusters/clusters-0/part-randomSeed >> 11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means >>Iteration 1 >> 11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to >>process : 1 >> 11/07/22 00:21:26 INFO mapred.JobClient: Running job: >>job_201107211512_0029 >> 11/07/22 00:21:27 INFO mapred.JobClient: map 0% reduce 0% >> 11/07/22 00:22:08 INFO mapred.JobClient: map 1% reduce 0% >> 11/07/22 00:22:20 INFO mapred.JobClient: map 2% reduce 0% >> 11/07/22 00:22:33 INFO mapred.JobClient: map 3% reduce 0% >> 11/07/22 00:22:42 INFO mapred.JobClient: map 4% reduce 0% >> 11/07/22 00:22:50 INFO mapred.JobClient: map 5% reduce 0% >> 11/07/22 00:23:00 INFO mapred.JobClient: map 6% reduce 0% >> 11/07/22 00:23:09 INFO mapred.JobClient: map 7% reduce 0% >> 11/07/22 00:23:18 INFO mapred.JobClient: map 8% reduce 0% >> 11/07/22 00:23:27 INFO mapred.JobClient: map 9% reduce 0% >> 11/07/22 00:23:33 INFO mapred.JobClient: map 10% reduce 0% >> 11/07/22 00:23:42 INFO mapred.JobClient: map 11% reduce 0% >> 11/07/22 00:23:45 INFO mapred.JobClient: map 12% reduce 0% >> 11/07/22 00:23:54 INFO mapred.JobClient: map 13% reduce 0% >> 11/07/22 00:24:03 INFO mapred.JobClient: map 14% reduce 0% >> 11/07/22 00:24:09 INFO mapred.JobClient: map 15% reduce 0% >> 11/07/22 00:24:15 INFO mapred.JobClient: map 16% reduce 0% >> 11/07/22 00:24:24 INFO mapred.JobClient: map 17% reduce 0% >> 11/07/22 00:24:30 INFO mapred.JobClient: map 18% reduce 0% >> 11/07/22 00:24:42 INFO mapred.JobClient: map 19% reduce 0% >> 11/07/22 00:24:51 INFO mapred.JobClient: map 20% reduce 0% >> 11/07/22 00:24:57 INFO mapred.JobClient: map 21% reduce 0% >> 11/07/22 00:25:06 INFO mapred.JobClient: map 22% reduce 0% >> 11/07/22 00:25:09 INFO mapred.JobClient: map 23% reduce 0% >> 11/07/22 00:25:19 INFO mapred.JobClient: map 24% reduce 0% >> 11/07/22 00:25:25 INFO mapred.JobClient: map 25% reduce 0% >> 11/07/22 00:25:31 INFO mapred.JobClient: map 26% reduce 0% >> 11/07/22 00:25:37 INFO mapred.JobClient: map 27% reduce 0% >> 11/07/22 00:25:43 INFO mapred.JobClient: map 28% reduce 0% >> 11/07/22 00:25:51 INFO mapred.JobClient: map 29% reduce 0% >> 11/07/22 00:25:58 INFO mapred.JobClient: map 30% reduce 0% >> 11/07/22 00:26:04 INFO mapred.JobClient: map 31% reduce 0% >> 11/07/22 00:26:10 INFO mapred.JobClient: map 32% reduce 0% >> 11/07/22 00:26:19 INFO mapred.JobClient: map 33% reduce 0% >> 11/07/22 00:26:25 INFO mapred.JobClient: map 34% reduce 0% >> 11/07/22 00:26:34 INFO mapred.JobClient: map 35% reduce 0% >> 11/07/22 00:26:40 INFO mapred.JobClient: map 36% reduce 0% >> 11/07/22 00:26:49 INFO mapred.JobClient: map 37% reduce 0% >> 11/07/22 00:26:55 INFO mapred.JobClient: map 38% reduce 0% >> 11/07/22 00:27:04 INFO mapred.JobClient: map 39% reduce 0% >> 11/07/22 00:27:14 INFO mapred.JobClient: map 40% reduce 0% >> 11/07/22 00:27:23 INFO mapred.JobClient: map 41% reduce 0% >> 11/07/22 00:27:28 INFO mapred.JobClient: map 42% reduce 0% >> 11/07/22 00:27:34 INFO mapred.JobClient: map 43% reduce 0% >> 11/07/22 00:27:40 INFO mapred.JobClient: map 44% reduce 0% >> 11/07/22 00:27:49 INFO mapred.JobClient: map 45% reduce 0% >> 11/07/22 00:27:56 INFO mapred.JobClient: map 46% reduce 0% >> 11/07/22 00:28:05 INFO mapred.JobClient: map 47% reduce 0% >> 11/07/22 00:28:11 INFO mapred.JobClient: map 48% reduce 0% >> 11/07/22 00:28:20 INFO mapred.JobClient: map 49% reduce 0% >> 11/07/22 00:28:26 INFO mapred.JobClient: map 50% reduce 0% >> 11/07/22 00:28:35 INFO mapred.JobClient: map 51% reduce 0% >> 11/07/22 00:28:41 INFO mapred.JobClient: map 52% reduce 0% >> 11/07/22 00:28:47 INFO mapred.JobClient: map 53% reduce 0% >> 11/07/22 00:28:53 INFO mapred.JobClient: map 54% reduce 0% >> 11/07/22 00:29:02 INFO mapred.JobClient: map 55% reduce 0% >> 11/07/22 00:29:08 INFO mapred.JobClient: map 56% reduce 0% >> 11/07/22 00:29:17 INFO mapred.JobClient: map 57% reduce 0% >> 11/07/22 00:29:26 INFO mapred.JobClient: map 58% reduce 0% >> 11/07/22 00:29:32 INFO mapred.JobClient: map 59% reduce 0% >> 11/07/22 00:29:41 INFO mapred.JobClient: map 60% reduce 0% >> 11/07/22 00:29:50 INFO mapred.JobClient: map 61% reduce 0% >> 11/07/22 00:29:53 INFO mapred.JobClient: map 62% reduce 0% >> 11/07/22 00:29:59 INFO mapred.JobClient: map 63% reduce 0% >> 11/07/22 00:30:09 INFO mapred.JobClient: map 64% reduce 0% >> 11/07/22 00:30:15 INFO mapred.JobClient: map 65% reduce 0% >> 11/07/22 00:30:23 INFO mapred.JobClient: map 66% reduce 0% >> 11/07/22 00:30:35 INFO mapred.JobClient: map 67% reduce 0% >> 11/07/22 00:30:41 INFO mapred.JobClient: map 68% reduce 0% >> 11/07/22 00:30:50 INFO mapred.JobClient: map 69% reduce 0% >> 11/07/22 00:30:56 INFO mapred.JobClient: map 70% reduce 0% >> 11/07/22 00:31:05 INFO mapred.JobClient: map 71% reduce 0% >> 11/07/22 00:31:15 INFO mapred.JobClient: map 72% reduce 0% >> 11/07/22 00:31:24 INFO mapred.JobClient: map 73% reduce 0% >> 11/07/22 00:31:30 INFO mapred.JobClient: map 74% reduce 0% >> 11/07/22 00:31:39 INFO mapred.JobClient: map 75% reduce 0% >> 11/07/22 00:31:42 INFO mapred.JobClient: map 76% reduce 0% >> 11/07/22 00:31:50 INFO mapred.JobClient: map 77% reduce 0% >> 11/07/22 00:31:59 INFO mapred.JobClient: map 78% reduce 0% >> 11/07/22 00:32:11 INFO mapred.JobClient: map 79% reduce 0% >> 11/07/22 00:32:28 INFO mapred.JobClient: map 80% reduce 0% >> 11/07/22 00:32:37 INFO mapred.JobClient: map 81% reduce 0% >> 11/07/22 00:32:40 INFO mapred.JobClient: map 82% reduce 0% >> 11/07/22 00:32:49 INFO mapred.JobClient: map 83% reduce 0% >> 11/07/22 00:32:58 INFO mapred.JobClient: map 84% reduce 0% >> 11/07/22 00:33:04 INFO mapred.JobClient: map 85% reduce 0% >> 11/07/22 00:33:13 INFO mapred.JobClient: map 86% reduce 0% >> 11/07/22 00:33:19 INFO mapred.JobClient: map 87% reduce 0% >> 11/07/22 00:33:32 INFO mapred.JobClient: map 88% reduce 0% >> 11/07/22 00:33:38 INFO mapred.JobClient: map 89% reduce 0% >> 11/07/22 00:33:47 INFO mapred.JobClient: map 90% reduce 0% >> 11/07/22 00:33:52 INFO mapred.JobClient: map 91% reduce 0% >> 11/07/22 00:34:01 INFO mapred.JobClient: map 92% reduce 0% >> 11/07/22 00:34:10 INFO mapred.JobClient: map 93% reduce 0% >> 11/07/22 00:34:13 INFO mapred.JobClient: map 94% reduce 0% >> 11/07/22 00:34:25 INFO mapred.JobClient: map 95% reduce 0% >> 11/07/22 00:34:31 INFO mapred.JobClient: map 96% reduce 0% >> 11/07/22 00:34:40 INFO mapred.JobClient: map 97% reduce 0% >> 11/07/22 00:34:47 INFO mapred.JobClient: map 98% reduce 0% >> 11/07/22 00:34:56 INFO mapred.JobClient: map 99% reduce 0% >> 11/07/22 00:35:02 INFO mapred.JobClient: map 100% reduce 0% >> 11/07/22 00:35:07 INFO mapred.JobClient: Task Id : >>attempt_201107211512_0029_m_000000_0, Status : FAILED >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any >>valid local directory for output/file.out >> at >>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) >> at >>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) >> at >>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) >> at >>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) >> at >>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639) >> at >>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) >> at >>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:259) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:416) >> at >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >> at org.apache.hadoop.mapred.Child.main(Child.java:253) >> >> >> 11/07/22 00:35:09 INFO mapred.JobClient: map 0% reduce 0% >> ... >> >> >>Q: What are the Hadoop heap settings you are using for your job? >>I am new to hadoop, not sure where to get those, but got these from >>localhost:50070, is it right? >>147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB / >>966.69 MB (3%) >> >> >>p/s: i keep forgetting to include my operating environment, sorry. I >>basically run this in a guest operating system (in a virtualbox virtual >>machine), assigned 1 CPU core, and 1.5GB of memory. Then the host operating >>system is OS X 10.6.8 running on alubook (macbook late 2008 model) with 4GB >>of memory. >> >> >> $ cat /etc/*-release >> DISTRIB_ID=Ubuntu >> DISTRIB_RELEASE=11.04 >> DISTRIB_CODENAME=natty >> DISTRIB_DESCRIPTION="Ubuntu 11.04" >> $ uname -a >> Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC >>2011 i686 i686 i386 GNU/Linux >> >> >>Best wishes, >>Jeffrey04 >> >>>________________________________ >>>From: Jeff Eastman <[email protected]> >>>To: "[email protected]" <[email protected]>; Jeffrey >>><[email protected]> >>>Sent: Thursday, July 21, 2011 11:54 PM >>>Subject: RE: fkmeans or Cluster Dumper not working? >>> >>>Excellent, so this appears to be localized to fuzzyk. Unfortunately, the >>>Apache mail server strips off attachments so you'd need another mechanism (a >>>JIRA?) to upload your data if it is not too large. Some more questions in >>>the interim: >>> >>>- What is the cardinality of your vector data? >>>- Is it sparse or dense? >>>- How many vectors are you trying to cluster? >>>- What is the exact error you see when fkmeans fails with k=10? With k=50? >>>- What are the Hadoop heap settings you are using for your job? >>> >>>-----Original Message----- >>>From: Jeffrey [mailto:[email protected]] >>>Sent: Thursday, July 21, 2011 11:21 AM >>>To: [email protected] >>>Subject: Re: fkmeans or Cluster Dumper not >working? >>> >>>Hi Jeff, >>> >>>Q: Did you change your invocation to specify a different -c directory (e.g. >>>clusters-0)? >>>A: Yes :) >>> >>>Q: Did you add the -cl argument? >>>A: Yes :) >>> >>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>>--overwrite --emitMostLikely false --numClusters 5 --maxIter 10 --m 5 >>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>>--overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5 >>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>>sensei/clusters --clusters sensei/clusters/clusters-0 --clustering >>>--overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5 >>> >>>Q: What is the new CLI invocation for clusterdump? >>>A: >>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir >sensei/clusters/clusteredPoints --output image-tag-clusters.txt >>> >>> >>>Q: Did this work for -k 10? What happens with -k 50? >>>A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails >>>when k=50, so i can't dump when k=50 >>> >>>Q: Have you tried kmeans? >>>A: Yes (all tested on 0.6-snapshot) >>> >>>k=5: no problem :) >>>k=10: no problem :) >>>k=50: no problem :) >>> >>>p/s: attached with the test data i used (in mvc format), let me know if you >>>guys prefer raw data in arff format >>> >>>Best wishes, >>>Jeffrey04 >>> >>> >>> >>>>________________________________ >>>>From: Jeff Eastman <[email protected]> >>>>To: "[email protected]" <[email protected]>; Jeffrey >>>><[email protected]> >>>>Sent: Thursday, July 21, 2011 9:36 PM >>>>Subject: RE: fkmeans or Cluster Dumper not working? >>>> >>>>You are correct, the wiki for fkmeans did not mention the -cl argument. >>>>I've added that just now. I think this is what Frank means in his comment >>>>but you do *not* have to write any custom code to get the cluster dumper to >>>>do what you want, just use the -cl argument and specify clusteredPoints as >>>>the -p input to clusterdump. >>>> >>>>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how >>>>to invoke the clustering and cluster dumper from Java at least. >>>> >>>>Did you change your invocation to specify a different -c directory (e.g. >>>>clusters-0)? >>>>Did you add the -cl argument? >>>>What is the new CLI invocation for clusterdump? >>>>Did this work for -k 10? What happens with -k >50? >>>>Have you tried kmeans? >>>> >>>>I can help you better if you will give me answers to my questions >>>> >>>>-----Original Message----- >>>>From: Jeffrey [mailto:[email protected]] >>>>Sent: Thursday, July 21, 2011 4:30 AM >>>>To: [email protected] >>>>Subject: Re: fkmeans or Cluster Dumper not working? >>>> >>>>Hi again, >>>> >>>>Let me update on what's working and what's not working. >>>> >>>>Works: >>>>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip >>>>fkmeans clustering (5 clusters) >>>>clusterdump (5 clusters) - so points are not included in the clusterdump >>>>and I need to write a program for it? >>>> >>>>Not Working: >>>>fkmeans clustering (50 clusters) - same error >>>>clusterdump (10 >clusters) - same error >>>> >>>> >>>>so it seems to attach points to the cluster dumper output like the >>>>synthetic control example does, i would have to write some code as pointed >>>>by @Frank_Scholten ? >>>>https://twitter.com/#!/Frank_Scholten/status/93617269296472064 >>>> >>>>Best wishes, >>>>Jeffrey04 >>>> >>>>>________________________________ >>>>>From: Jeff Eastman <[email protected]> >>>>>To: "[email protected]" <[email protected]>; Jeffrey >>>>><[email protected]> >>>>>Sent: Wednesday, July 20, 2011 11:53 PM >>>>>Subject: RE: fkmeans or Cluster Dumper not working? >>>>> >>>>>Hi Jeffrey, >>>>> >>>>>It is always difficult to debug remotely, but here are some suggestions: >>>>>- First, you are specifying both an input clusters directory --clusters >>>>>and --numClusters clusters so the job is sampling 10 points from your >>>>>input data set and writing them to clusteredPoints as the prior clusters >>>>>for the first iteration. You should pick a different name for this >>>>>directory, as the clusteredPoints directory is used by the -cl >>>>>(--clustering) option (which you did not supply) to write out the >>>>>clustered (classified) input vectors. When you subsequently supplied >>>>>clusteredPoints to the clusterdumper it was expecting a different format >>>>>and that caused the exception you saw. Change your --clusters directory >>>>>(clusters-0 is good) >and add a -cl argument and things should go more smoothly. The -cl option is >not the default and so no clustering of the input points is performed without >this (Many people get caught by this and perhaps the default should be >changed, but clustering can be expensive and so it is not performed without >request). >>>>>- If you still have problems, try again with k-means. The similarity to >>>>>fkmeans is good and it will eliminate fkmeans itself if you see the same >>>>>problems with k-means >>>>>- I don't see why changing the -k argument from 10 to 50 should cause any >>>>>problems, unless your vectors are very large and you are getting an OME in >>>>>the reducer. Since the reducer is calculating centroid vectors for the >>>>>next iteration these will become more dense and memory will increase >>>>>substantially. >>>>>- I can't figure out what might be causing your second exception. It is >>>>>bombing inside of Hadoop file IO and this causes me to suspect command >>>>>argument >problems. >>>>> >>>>>Hope this helps, >>>>>Jeff >>>>> >>>>> >>>>>-----Original Message----- >>>>>From: Jeffrey [mailto:[email protected]] >>>>>Sent: Wednesday, July 20, 2011 2:41 AM >>>>>To: [email protected] >>>>>Subject: fkmeans or Cluster Dumper not working? >>>>> >>>>>Hi, >>>>> >>>>>I am trying to generate clusters using the fkmeans command line tool from >>>>>my test data. Not sure if this is correct, as it only runs one iteration >>>>>(output from 0.6-snapshot, gotta use some workaround to some weird bug - >>>>>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans >>>>> ) >>>>> >>>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>>>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 >>>>>--numClusters 10 --overwrite --m 5 >>>>>Running on hadoop, using >>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB: >>>>> >>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20 >>>>> 14:05:18 INFO common.AbstractJob: Command line arguments: >>>>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, >>>>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, >>>>> --emitMostLikely=true, --endPhase=2147483647, >>>>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, >>>>>--method=mapreduce, --numClusters=10, --output=sensei/clusters, >>>>>--overwrite=null, --startPhase=0, --tempDir=temp, --threshold=0}11/07/20 >>>>>14:05:20 INFO common.HadoopUtil: Deleting sensei/clusters11/07/20 >14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 >14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library11/07/20 >14:05:20 INFO zlib.ZlibFactory: Successfully >>>>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO >>>>>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO >>>>>compress.CodecPool: Got brand-new decompressor >>>>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to >>>>>sensei/clusteredPoints/part-randomSeed >>>>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means >>>>>Iteration 1 >>>>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process >>>>>: 1 >>>>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021 >>>>>11/07/20 14:05:31 INFO mapred.JobClient: map 0% reduce 0% >>>>>11/07/20 14:05:54 INFO mapred.JobClient: map 2% reduce 0% >>>>>11/07/20 14:05:57 INFO >mapred.JobClient: map 5% reduce 0% >>>>>11/07/20 14:06:00 INFO mapred.JobClient: map 6% reduce 0% >>>>>11/07/20 14:06:03 INFO mapred.JobClient: map 7% reduce 0% >>>>>11/07/20 14:06:07 INFO mapred.JobClient: map 10% reduce 0% >>>>>11/07/20 14:06:10 INFO mapred.JobClient: map 13% reduce 0% >>>>>11/07/20 14:06:13 INFO mapred.JobClient: map 15% reduce 0% >>>>>11/07/20 14:06:16 INFO mapred.JobClient: map 17% reduce 0% >>>>>11/07/20 14:06:19 INFO mapred.JobClient: map 19% reduce 0% >>>>>11/07/20 14:06:22 INFO mapred.JobClient: map 23% reduce 0% >>>>>11/07/20 14:06:25 INFO mapred.JobClient: map 25% reduce 0% >>>>>11/07/20 14:06:28 INFO mapred.JobClient: map 27% reduce 0% >>>>>11/07/20 14:06:31 INFO mapred.JobClient: map 30% reduce 0% >>>>>11/07/20 14:06:34 INFO mapred.JobClient: map 33% reduce >0% >>>>>11/07/20 14:06:37 INFO mapred.JobClient: map 36% reduce 0% >>>>>11/07/20 14:06:40 INFO mapred.JobClient: map 37% reduce 0% >>>>>11/07/20 14:06:43 INFO mapred.JobClient: map 40% reduce 0% >>>>>11/07/20 14:06:46 INFO mapred.JobClient: map 43% reduce 0% >>>>>11/07/20 14:06:49 INFO mapred.JobClient: map 46% reduce 0% >>>>>11/07/20 14:06:52 INFO mapred.JobClient: map 48% reduce 0% >>>>>11/07/20 14:06:55 INFO mapred.JobClient: map 50% reduce 0% >>>>>11/07/20 14:06:57 INFO mapred.JobClient: map 53% reduce 0% >>>>>11/07/20 14:07:00 INFO mapred.JobClient: map 56% reduce 0% >>>>>11/07/20 14:07:03 INFO mapred.JobClient: map 58% reduce 0% >>>>>11/07/20 14:07:06 INFO mapred.JobClient: map 60% reduce 0% >>>>>11/07/20 14:07:09 INFO mapred.JobClient: map 63% reduce 0% >>>>>11/07/20 14:07:13 INFO >mapred.JobClient: map 65% reduce 0% >>>>>11/07/20 14:07:16 INFO mapred.JobClient: map 67% reduce 0% >>>>>11/07/20 14:07:19 INFO mapred.JobClient: map 70% reduce 0% >>>>>11/07/20 14:07:22 INFO mapred.JobClient: map 73% reduce 0% >>>>>11/07/20 14:07:25 INFO mapred.JobClient: map 75% reduce 0% >>>>>11/07/20 14:07:28 INFO mapred.JobClient: map 77% reduce 0% >>>>>11/07/20 14:07:31 INFO mapred.JobClient: map 80% reduce 0% >>>>>11/07/20 14:07:34 INFO mapred.JobClient: map 83% reduce 0% >>>>>11/07/20 14:07:37 INFO mapred.JobClient: map 85% reduce 0% >>>>>11/07/20 14:07:40 INFO mapred.JobClient: map 87% reduce 0% >>>>>11/07/20 14:07:43 INFO mapred.JobClient: map 89% reduce 0% >>>>>11/07/20 14:07:46 INFO mapred.JobClient: map 92% reduce 0% >>>>>11/07/20 14:07:49 INFO mapred.JobClient: map 95% reduce >0% >>>>>11/07/20 14:07:55 INFO mapred.JobClient: map 98% reduce 0% >>>>>11/07/20 14:07:59 INFO mapred.JobClient: map 99% reduce 0% >>>>>11/07/20 14:08:02 INFO mapred.JobClient: map 100% reduce 0% >>>>>11/07/20 14:08:23 INFO mapred.JobClient: map 100% reduce 100% >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: >>>>>job_201107201152_0021 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Job Counters >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Launched reduce tasks=1 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=149314 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Total time spent by all >>>>>reduces waiting after reserving slots (ms)=0 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Total time spent by all maps >>>>>waiting after >reserving slots (ms)=0 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Launched map tasks=1 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Data-local map tasks=1 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15618 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: File Output Format Counters >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Bytes Written=2247222 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Clustering >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Converged Clusters=10 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: FileSystemCounters >>>>>11/07/20 14:08:31 INFO mapred.JobClient: FILE_BYTES_READ=130281382 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: HDFS_BYTES_READ=254494 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: >FILE_BYTES_WRITTEN=132572666 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2247222 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: File Input Format Counters >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Bytes Read=247443 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Map-Reduce Framework >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Reduce input groups=10 >>>>>11/07/20 14:08:31 INFO mapred.JobClient: Map output materialized >>>>>bytes=2246233 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Combine output records=330 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Map input records=1113 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Reduce shuffle bytes=2246233 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Reduce output records=10 >>>>>11/07/20 14:08:32 INFO >mapred.JobClient: Spilled Records=590 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Map output bytes=2499995001 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Combine input records=11450 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Map output records=11130 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 >>>>>11/07/20 14:08:32 INFO mapred.JobClient: Reduce input records=10 >>>>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms >>>>> >>>>>if I increase the --numClusters argument (e.g. 50), then it will return >>>>>exception after >>>>>11/07/20 14:08:02 INFO mapred.JobClient: map 100% reduce 0% >>>>> >>>>>and would retry again (also reproducible using 0.6-snapshot) >>>>> >>>>>... >>>>>11/07/20 14:22:25 INFO mapred.JobClient: map 100% reduce >0% >>>>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : >>>>>attempt_201107201152_0022_m_000000_0, Status : FAILED >>>>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any >>>>>valid local directory for output/file.out >>>>> at >>>>>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) >>>>> at >>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) >>>>> at >>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) >>>>> at >>>>>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) >>>>> at >>>>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639) >>>>> at >org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) >>>>> at >>>>>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) >>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >>>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:259) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:416) >>>>> at >>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>>> at org.apache.hadoop.mapred.Child.main(Child.java:253) >>>>> >>>>>11/07/20 14:22:32 INFO >mapred.JobClient: map 0% reduce 0% >>>>>... >>>>> >>>>>Then I ran cluster dumper to dump information about the clusters, this >>>>>command would work if I only care about the cluster centroids (both 0.5 >>>>>release and 0.6-snapshot) >>>>> >>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output >>>>>image-tag-clusters.txt >>>>>Running on hadoop, using >>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>>>MAHOUT-JOB: >>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: >>>>>{--dictionaryType=text, --endPhase=2147483647, >>>>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, >>>>>--startPhase=0, --tempDir=temp} >>>>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 >ms >>>>> >>>>>but if I want to see the degree of membership of each points, I get >>>>>another exception (yes, reproducible for both 0.5 release and 0.6-snapshot) >>>>> >>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output >>>>>image-tag-clusters.txt --pointsDir sensei/clusteredPoints >>>>>Running on hadoop, using >>>>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>>>>MAHOUT-JOB: >>>>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>>>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: >>>>>{--dictionaryType=text, --endPhase=2147483647, >>>>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, >>>>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp} >>>>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop >library >>>>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized >>>>>native-zlib library >>>>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor >>>>>Exception in thread "main" java.lang.ClassCastException: >>>>>org.apache.hadoop.io.Text cannot be cast to >>>>>org.apache.hadoop.io.IntWritable >>>>> at >>>>>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261) >>>>> at >>>>>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209) >>>>> at >>>>>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123) >>>>> at >>>>>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> > at >sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>> at >>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:616) >>>>> at >>>>>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>> at >>>>>org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>> at >>>>>org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>> at >sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:616) >>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>>>> >>>>>erm, would writing a short program to call the API (btw, can't seem to >>>>>find the latest API doc?) be a better choice here? Or did I do anything >>>>>wrong here (yes, Java is not my main language, and I am very new to >>>>>Mahout.. and h)? >>>>> >>>>>the data is converted from an arff file with about 1000 rows (resource) >>>>>and 14k columns (tag), and it is just a subset of my data. (actually made >>>>>a mistake so it is now generating resource clusters instead of tag >>>>>clusters, but I am just doing this as a proof of concept whether mahout is >>>>>good enough for the task) >>>>> >>>>>Best >wishes, >>>>>Jeffrey04 >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> > > >
