Re: fkmeans or Cluster Dumper not working?

Jeffrey Thu, 21 Jul 2011 08:21:24 -0700

Hi Jeff,

Q: Did you change your invocation to specify a different -c directory (e.g. 
clusters-0)?
A: Yes :)


Q: Did you add the -cl argument?
A: Yes :)

$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely 
false --numClusters 5 --maxIter 10 --m 5
$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely 
false --numClusters 10 --maxIter 10 --m 5
$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely 
false --numClusters 50 --maxIter 10 --m 5

Q: What is the new CLI invocation for clusterdump?
A: 
$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir 
sensei/clusters/clusteredPoints --output image-tag-clusters.txt


Q: Did this work for -k 10? What happens with -k 50?
A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails when 
k=50, so i can't dump when k=50

Q: Have you tried kmeans?
A: Yes (all tested on 0.6-snapshot)

k=5: no problem :)
k=10: no problem :)
k=50: no problem :)

p/s: attached with the test data i used (in mvc format), let me know if you 
guys prefer raw data in arff format

Best wishes,
Jeffrey04



>________________________________
>From: Jeff Eastman <[email protected]>
>To: "[email protected]" <[email protected]>; Jeffrey 
><[email protected]>
>Sent: Thursday, July 21, 2011 9:36 PM
>Subject: RE: fkmeans or Cluster Dumper not working?
>
>You are correct, the wiki for fkmeans did not mention the -cl argument. I've 
>added that just now. I think this is what Frank means in his comment but you 
>do *not* have to write any custom code to get the cluster dumper to do what 
>you want, just use the -cl argument and specify clusteredPoints as the -p 
>input to clusterdump.
>
>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how to 
>invoke the clustering and cluster dumper from Java at least.
>
>Did you change your invocation to specify a different -c directory (e.g. 
>clusters-0)?
>Did you add the -cl argument?
>What is the new CLI invocation for clusterdump?
>Did this work for -k 10? What happens with -k 50?
>Have you tried kmeans?
>
>I can help you better if you will give me answers to my questions
>
>-----Original Message-----
>From: Jeffrey [mailto:[email protected]]
>Sent: Thursday, July 21, 2011 4:30 AM
>To: [email protected]
>Subject: Re: fkmeans or Cluster Dumper not working?
>
>Hi again,
>
>Let me update on what's working and what's not working.
>
>Works:
>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>fkmeans clustering (5 clusters)
>clusterdump (5 clusters) - so points are not included in the clusterdump and I 
>need to write a program for it?
>
>Not Working:
>fkmeans clustering (50 clusters) - same error
>clusterdump (10 clusters) - same error
>
>
>so it seems to attach points to the cluster dumper output like the synthetic 
>control example does, i would have to write some code as pointed by 
>@Frank_Scholten ? 
>https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>
>Best wishes,
>Jeffrey04
>
>>________________________________
>>From: Jeff Eastman <[email protected]>
>>To: "[email protected]" <[email protected]>; Jeffrey 
>><[email protected]>
>>Sent: Wednesday, July 20, 2011 11:53 PM
>>Subject: RE: fkmeans or Cluster Dumper not working?
>>
>>Hi Jeffrey,
>>
>>It is always difficult to debug remotely, but here are some suggestions:
>>- First, you are specifying both an input clusters directory --clusters and 
>>--numClusters clusters so the job is sampling 10 points from your input data 
>>set and writing them to clusteredPoints as the prior clusters for the first 
>>iteration. You should pick a different name for this directory, as the 
>>clusteredPoints directory is used by the -cl (--clustering) option (which you 
>>did not supply) to write out the clustered (classified) input vectors. When 
>>you subsequently supplied clusteredPoints to the clusterdumper it was 
>>expecting a different format and that caused the exception you saw. Change 
>>your --clusters directory (clusters-0 is good) and add a -cl argument and 
>>things should go more smoothly. The -cl option is not the default and so no 
>>clustering of the input points is performed without this (Many people get 
>>caught by this and perhaps the default should be changed, but clustering can 
>>be expensive and so it is not performed without request).
>>- If you still have problems, try again with k-means. The similarity to 
>>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>>problems with k-means
>>- I don't see why changing the -k argument from 10 to 50 should cause any 
>>problems, unless your vectors are very large and you are getting an OME in 
>>the reducer. Since the reducer is calculating centroid vectors for the next 
>>iteration these will become more dense and memory will increase substantially.
>>- I can't figure out what might be causing your second exception. It is 
>>bombing inside of Hadoop file IO and this causes me to suspect command 
>>argument problems.
>>
>>Hope this helps,
>>Jeff
>>
>>
>>-----Original Message-----
>>From: Jeffrey [mailto:[email protected]]
>>Sent: Wednesday, July 20, 2011 2:41 AM
>>To: [email protected]
>>Subject: fkmeans or Cluster Dumper not working?
>>
>>Hi,
>>
>>I am trying to generate clusters using the fkmeans command line tool from my 
>>test data. Not sure if this is correct, as it only runs one iteration (output 
>>from 0.6-snapshot, gotta use some workaround to some weird bug - 
>>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>> )
>>
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 
>>10 --overwrite --m 5
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>> 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> --emitMostLikely=true, --endPhase=2147483647, 
>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
>>--numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
>>--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: 
>>Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting 
>>sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded 
>>the native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully
>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
>>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
>>compress.CodecPool: Got brand-new decompressor
>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
>>sensei/clusteredPoints/part-randomSeed
>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 
>>1
>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1
>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>11/07/20 14:05:57 INFO mapred.JobClient:  map 5% reduce 0%
>>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce 0%
>>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>11/07/20 14:07:13 INFO mapred.JobClient:  map 65% reduce 0%
>>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce 0%
>>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces 
>>waiting after reserving slots (ms)=0
>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
>>waiting after reserving slots (ms)=0
>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters
>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132572666
>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters
>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
>>bytes=2246233
>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>>11/07/20 14:08:32 INFO mapred.JobClient:     Spilled Records=590
>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>>
>>if I increase the --numClusters argument (e.g. 50), then it will return 
>>exception after
>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>
>>and would retry again (also reproducible using 0.6-snapshot)
>>
>>...
>>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce 0%
>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
>>attempt_201107201152_0022_m_000000_0, Status : FAILED
>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>>valid local directory for output/file.out
>>        at 
>>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>        at 
>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>        at 
>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>        at 
>>org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>        at 
>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>        at 
>>org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>        at 
>>org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>        at 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>
>>11/07/20 14:22:32 INFO mapred.JobClient:  map 0% reduce 0%
>>...
>>
>>Then I ran cluster dumper to dump information about the clusters, this 
>>command would work if I only care about the cluster centroids (both 0.5 
>>release and 0.6-snapshot)
>>
>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>image-tag-clusters.txt
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
>>{--dictionaryType=text, --endPhase=2147483647, 
>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>>--startPhase=0, --tempDir=temp}
>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms
>>
>>but if I want to see the degree of membership of each points, I get another 
>>exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>
>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
>>{--dictionaryType=text, --endPhase=2147483647, 
>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>>native-zlib library
>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>>Exception in thread "main" java.lang.ClassCastException: 
>>org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
>>        at 
>>org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>        at 
>>org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>        at 
>>org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>        at 
>>org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at 
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at 
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>        at 
>>org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at 
>>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at 
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>erm, would writing a short program to call the API (btw, can't seem to find 
>>the latest API doc?) be a better choice here? Or did I do anything wrong here 
>>(yes, Java is not my main language, and I am very new to Mahout.. and h)?
>>
>>the data is converted from an arff file with about 1000 rows (resource) and 
>>14k columns (tag), and it is just a subset of my data. (actually made a 
>>mistake so it is now generating resource clusters instead of tag clusters, 
>>but I am just doing this as a proof of concept whether mahout is good enough 
>>for the task)
>>
>>Best wishes,
>>Jeffrey04
>>
>>
>>
>
>
>

Re: fkmeans or Cluster Dumper not working?

Reply via email to