Hi Jeff, Q: Did you change your invocation to specify a different -c directory (e.g. clusters-0)? A: Yes :)
Q: Did you add the -cl argument? A: Yes :) $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 5 --maxIter 10 --m 5 $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 10 --maxIter 10 --m 5 $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 50 --maxIter 10 --m 5 Q: What is the new CLI invocation for clusterdump? A: $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir sensei/clusters/clusteredPoints --output image-tag-clusters.txt Q: Did this work for -k 10? What happens with -k 50? A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails when k=50, so i can't dump when k=50 Q: Have you tried kmeans? A: Yes (all tested on 0.6-snapshot) k=5: no problem :) k=10: no problem :) k=50: no problem :) p/s: attached with the test data i used (in mvc format), let me know if you guys prefer raw data in arff format Best wishes, Jeffrey04 >________________________________ >From: Jeff Eastman <[email protected]> >To: "[email protected]" <[email protected]>; Jeffrey ><[email protected]> >Sent: Thursday, July 21, 2011 9:36 PM >Subject: RE: fkmeans or Cluster Dumper not working? > >You are correct, the wiki for fkmeans did not mention the -cl argument. I've >added that just now. I think this is what Frank means in his comment but you >do *not* have to write any custom code to get the cluster dumper to do what >you want, just use the -cl argument and specify clusteredPoints as the -p >input to clusterdump. > >Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how to >invoke the clustering and cluster dumper from Java at least. > >Did you change your invocation to specify a different -c directory (e.g. >clusters-0)? >Did you add the -cl argument? >What is the new CLI invocation for clusterdump? >Did this work for -k 10? What happens with -k 50? >Have you tried kmeans? > >I can help you better if you will give me answers to my questions > >-----Original Message----- >From: Jeffrey [mailto:[email protected]] >Sent: Thursday, July 21, 2011 4:30 AM >To: [email protected] >Subject: Re: fkmeans or Cluster Dumper not working? > >Hi again, > >Let me update on what's working and what's not working. > >Works: >fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip >fkmeans clustering (5 clusters) >clusterdump (5 clusters) - so points are not included in the clusterdump and I >need to write a program for it? > >Not Working: >fkmeans clustering (50 clusters) - same error >clusterdump (10 clusters) - same error > > >so it seems to attach points to the cluster dumper output like the synthetic >control example does, i would have to write some code as pointed by >@Frank_Scholten ? >https://twitter.com/#!/Frank_Scholten/status/93617269296472064 > >Best wishes, >Jeffrey04 > >>________________________________ >>From: Jeff Eastman <[email protected]> >>To: "[email protected]" <[email protected]>; Jeffrey >><[email protected]> >>Sent: Wednesday, July 20, 2011 11:53 PM >>Subject: RE: fkmeans or Cluster Dumper not working? >> >>Hi Jeffrey, >> >>It is always difficult to debug remotely, but here are some suggestions: >>- First, you are specifying both an input clusters directory --clusters and >>--numClusters clusters so the job is sampling 10 points from your input data >>set and writing them to clusteredPoints as the prior clusters for the first >>iteration. You should pick a different name for this directory, as the >>clusteredPoints directory is used by the -cl (--clustering) option (which you >>did not supply) to write out the clustered (classified) input vectors. When >>you subsequently supplied clusteredPoints to the clusterdumper it was >>expecting a different format and that caused the exception you saw. Change >>your --clusters directory (clusters-0 is good) and add a -cl argument and >>things should go more smoothly. The -cl option is not the default and so no >>clustering of the input points is performed without this (Many people get >>caught by this and perhaps the default should be changed, but clustering can >>be expensive and so it is not performed without request). >>- If you still have problems, try again with k-means. The similarity to >>fkmeans is good and it will eliminate fkmeans itself if you see the same >>problems with k-means >>- I don't see why changing the -k argument from 10 to 50 should cause any >>problems, unless your vectors are very large and you are getting an OME in >>the reducer. Since the reducer is calculating centroid vectors for the next >>iteration these will become more dense and memory will increase substantially. >>- I can't figure out what might be causing your second exception. It is >>bombing inside of Hadoop file IO and this causes me to suspect command >>argument problems. >> >>Hope this helps, >>Jeff >> >> >>-----Original Message----- >>From: Jeffrey [mailto:[email protected]] >>Sent: Wednesday, July 20, 2011 2:41 AM >>To: [email protected] >>Subject: fkmeans or Cluster Dumper not working? >> >>Hi, >> >>I am trying to generate clusters using the fkmeans command line tool from my >>test data. Not sure if this is correct, as it only runs one iteration (output >>from 0.6-snapshot, gotta use some workaround to some weird bug - >>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans >> ) >> >>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output >>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters >>10 --overwrite --m 5 >>Running on hadoop, using >>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB: >> >>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20 >> 14:05:18 INFO common.AbstractJob: Command line arguments: >>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, >>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, >> --emitMostLikely=true, --endPhase=2147483647, >>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, >>--numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, >>--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: >>Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting >>sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded >>the native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully >>loaded & initialized native-zlib library11/07/20 14:05:20 INFO >>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO >>compress.CodecPool: Got brand-new decompressor >>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to >>sensei/clusteredPoints/part-randomSeed >>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration >>1 >>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1 >>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021 >>11/07/20 14:05:31 INFO mapred.JobClient: map 0% reduce 0% >>11/07/20 14:05:54 INFO mapred.JobClient: map 2% reduce 0% >>11/07/20 14:05:57 INFO mapred.JobClient: map 5% reduce 0% >>11/07/20 14:06:00 INFO mapred.JobClient: map 6% reduce 0% >>11/07/20 14:06:03 INFO mapred.JobClient: map 7% reduce 0% >>11/07/20 14:06:07 INFO mapred.JobClient: map 10% reduce 0% >>11/07/20 14:06:10 INFO mapred.JobClient: map 13% reduce 0% >>11/07/20 14:06:13 INFO mapred.JobClient: map 15% reduce 0% >>11/07/20 14:06:16 INFO mapred.JobClient: map 17% reduce 0% >>11/07/20 14:06:19 INFO mapred.JobClient: map 19% reduce 0% >>11/07/20 14:06:22 INFO mapred.JobClient: map 23% reduce 0% >>11/07/20 14:06:25 INFO mapred.JobClient: map 25% reduce 0% >>11/07/20 14:06:28 INFO mapred.JobClient: map 27% reduce 0% >>11/07/20 14:06:31 INFO mapred.JobClient: map 30% reduce 0% >>11/07/20 14:06:34 INFO mapred.JobClient: map 33% reduce 0% >>11/07/20 14:06:37 INFO mapred.JobClient: map 36% reduce 0% >>11/07/20 14:06:40 INFO mapred.JobClient: map 37% reduce 0% >>11/07/20 14:06:43 INFO mapred.JobClient: map 40% reduce 0% >>11/07/20 14:06:46 INFO mapred.JobClient: map 43% reduce 0% >>11/07/20 14:06:49 INFO mapred.JobClient: map 46% reduce 0% >>11/07/20 14:06:52 INFO mapred.JobClient: map 48% reduce 0% >>11/07/20 14:06:55 INFO mapred.JobClient: map 50% reduce 0% >>11/07/20 14:06:57 INFO mapred.JobClient: map 53% reduce 0% >>11/07/20 14:07:00 INFO mapred.JobClient: map 56% reduce 0% >>11/07/20 14:07:03 INFO mapred.JobClient: map 58% reduce 0% >>11/07/20 14:07:06 INFO mapred.JobClient: map 60% reduce 0% >>11/07/20 14:07:09 INFO mapred.JobClient: map 63% reduce 0% >>11/07/20 14:07:13 INFO mapred.JobClient: map 65% reduce 0% >>11/07/20 14:07:16 INFO mapred.JobClient: map 67% reduce 0% >>11/07/20 14:07:19 INFO mapred.JobClient: map 70% reduce 0% >>11/07/20 14:07:22 INFO mapred.JobClient: map 73% reduce 0% >>11/07/20 14:07:25 INFO mapred.JobClient: map 75% reduce 0% >>11/07/20 14:07:28 INFO mapred.JobClient: map 77% reduce 0% >>11/07/20 14:07:31 INFO mapred.JobClient: map 80% reduce 0% >>11/07/20 14:07:34 INFO mapred.JobClient: map 83% reduce 0% >>11/07/20 14:07:37 INFO mapred.JobClient: map 85% reduce 0% >>11/07/20 14:07:40 INFO mapred.JobClient: map 87% reduce 0% >>11/07/20 14:07:43 INFO mapred.JobClient: map 89% reduce 0% >>11/07/20 14:07:46 INFO mapred.JobClient: map 92% reduce 0% >>11/07/20 14:07:49 INFO mapred.JobClient: map 95% reduce 0% >>11/07/20 14:07:55 INFO mapred.JobClient: map 98% reduce 0% >>11/07/20 14:07:59 INFO mapred.JobClient: map 99% reduce 0% >>11/07/20 14:08:02 INFO mapred.JobClient: map 100% reduce 0% >>11/07/20 14:08:23 INFO mapred.JobClient: map 100% reduce 100% >>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021 >>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26 >>11/07/20 14:08:31 INFO mapred.JobClient: Job Counters >>11/07/20 14:08:31 INFO mapred.JobClient: Launched reduce tasks=1 >>11/07/20 14:08:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=149314 >>11/07/20 14:08:31 INFO mapred.JobClient: Total time spent by all reduces >>waiting after reserving slots (ms)=0 >>11/07/20 14:08:31 INFO mapred.JobClient: Total time spent by all maps >>waiting after reserving slots (ms)=0 >>11/07/20 14:08:31 INFO mapred.JobClient: Launched map tasks=1 >>11/07/20 14:08:31 INFO mapred.JobClient: Data-local map tasks=1 >>11/07/20 14:08:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15618 >>11/07/20 14:08:31 INFO mapred.JobClient: File Output Format Counters >>11/07/20 14:08:31 INFO mapred.JobClient: Bytes Written=2247222 >>11/07/20 14:08:31 INFO mapred.JobClient: Clustering >>11/07/20 14:08:31 INFO mapred.JobClient: Converged Clusters=10 >>11/07/20 14:08:31 INFO mapred.JobClient: FileSystemCounters >>11/07/20 14:08:31 INFO mapred.JobClient: FILE_BYTES_READ=130281382 >>11/07/20 14:08:31 INFO mapred.JobClient: HDFS_BYTES_READ=254494 >>11/07/20 14:08:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=132572666 >>11/07/20 14:08:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2247222 >>11/07/20 14:08:31 INFO mapred.JobClient: File Input Format Counters >>11/07/20 14:08:31 INFO mapred.JobClient: Bytes Read=247443 >>11/07/20 14:08:31 INFO mapred.JobClient: Map-Reduce Framework >>11/07/20 14:08:31 INFO mapred.JobClient: Reduce input groups=10 >>11/07/20 14:08:31 INFO mapred.JobClient: Map output materialized >>bytes=2246233 >>11/07/20 14:08:32 INFO mapred.JobClient: Combine output records=330 >>11/07/20 14:08:32 INFO mapred.JobClient: Map input records=1113 >>11/07/20 14:08:32 INFO mapred.JobClient: Reduce shuffle bytes=2246233 >>11/07/20 14:08:32 INFO mapred.JobClient: Reduce output records=10 >>11/07/20 14:08:32 INFO mapred.JobClient: Spilled Records=590 >>11/07/20 14:08:32 INFO mapred.JobClient: Map output bytes=2499995001 >>11/07/20 14:08:32 INFO mapred.JobClient: Combine input records=11450 >>11/07/20 14:08:32 INFO mapred.JobClient: Map output records=11130 >>11/07/20 14:08:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=127 >>11/07/20 14:08:32 INFO mapred.JobClient: Reduce input records=10 >>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms >> >>if I increase the --numClusters argument (e.g. 50), then it will return >>exception after >>11/07/20 14:08:02 INFO mapred.JobClient: map 100% reduce 0% >> >>and would retry again (also reproducible using 0.6-snapshot) >> >>... >>11/07/20 14:22:25 INFO mapred.JobClient: map 100% reduce 0% >>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : >>attempt_201107201152_0022_m_000000_0, Status : FAILED >>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any >>valid local directory for output/file.out >> at >>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381) >> at >>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) >> at >>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) >> at >>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) >> at >>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639) >> at >>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) >> at >>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:259) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:416) >> at >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >> at org.apache.hadoop.mapred.Child.main(Child.java:253) >> >>11/07/20 14:22:32 INFO mapred.JobClient: map 0% reduce 0% >>... >> >>Then I ran cluster dumper to dump information about the clusters, this >>command would work if I only care about the cluster centroids (both 0.5 >>release and 0.6-snapshot) >> >>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output >>image-tag-clusters.txt >>Running on hadoop, using >>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>MAHOUT-JOB: >>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: >>{--dictionaryType=text, --endPhase=2147483647, >>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, >>--startPhase=0, --tempDir=temp} >>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms >> >>but if I want to see the degree of membership of each points, I get another >>exception (yes, reproducible for both 0.5 release and 0.6-snapshot) >> >>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output >>image-tag-clusters.txt --pointsDir sensei/clusteredPoints >>Running on hadoop, using >>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0 >>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf >>MAHOUT-JOB: >>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar >>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: >>{--dictionaryType=text, --endPhase=2147483647, >>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, >>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp} >>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library >>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized >>native-zlib library >>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor >>Exception in thread "main" java.lang.ClassCastException: >>org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable >> at >>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123) >> at >>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at >>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >>erm, would writing a short program to call the API (btw, can't seem to find >>the latest API doc?) be a better choice here? Or did I do anything wrong here >>(yes, Java is not my main language, and I am very new to Mahout.. and h)? >> >>the data is converted from an arff file with about 1000 rows (resource) and >>14k columns (tag), and it is just a subset of my data. (actually made a >>mistake so it is now generating resource clusters instead of tag clusters, >>but I am just doing this as a proof of concept whether mahout is good enough >>for the task) >> >>Best wishes, >>Jeffrey04 >> >> >> > > >
