RE: fkmeans or Cluster Dumper not working?

Jeff Eastman Thu, 21 Jul 2011 08:55:15 -0700

Excellent, so this appears to be localized to fuzzyk. Unfortunately, the Apache 
mail server strips off attachments so you'd need another mechanism (a JIRA?) to 
upload your data if it is not too large. Some more questions in the interim:


- What is the cardinality of your vector data?
- Is it sparse or dense?
- How many vectors are you trying to cluster?
- What is the exact error you see when fkmeans fails with k=10? With k=50?
- What are the Hadoop heap settings you are using for your job?

-----Original Message-----
From: Jeffrey [mailto:[email protected]]
Sent: Thursday, July 21, 2011 11:21 AM
To: [email protected]
Subject: Re: fkmeans or Cluster Dumper not working?

Hi Jeff,

Q: Did you change your invocation to specify a different -c directory (e.g. 
clusters-0)?
A: Yes :)

Q: Did you add the -cl argument?
A: Yes :)

$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely 
false --numClusters 5 --maxIter 10 --m 5
$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely 
false --numClusters 10 --maxIter 10 --m 5
$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters 
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely 
false --numClusters 50 --maxIter 10 --m 5

Q: What is the new CLI invocation for clusterdump?
A:
$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir 
sensei/clusters/clusteredPoints --output image-tag-clusters.txt


Q: Did this work for -k 10? What happens with -k 50?
A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails when 
k=50, so i can't dump when k=50

Q: Have you tried kmeans?
A: Yes (all tested on 0.6-snapshot)

k=5: no problem :)
k=10: no problem :)
k=50: no problem :)

p/s: attached with the test data i used (in mvc format), let me know if you 
guys prefer raw data in arff format

Best wishes,
Jeffrey04



>________________________________
>From: Jeff Eastman <[email protected]>
>To: "[email protected]" <[email protected]>; Jeffrey 
><[email protected]>
>Sent: Thursday, July 21, 2011 9:36 PM
>Subject: RE: fkmeans or Cluster Dumper not working?
>
>You are correct, the wiki for fkmeans did not mention the -cl argument. I've 
>added that just now. I think this is what Frank means in his comment but you 
>do *not* have to write any custom code to get the cluster dumper to do what 
>you want, just use the -cl argument and specify clusteredPoints as the -p 
>input to clusterdump.
>
>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how to 
>invoke the clustering and cluster dumper from Java at least.
>
>Did you change your invocation to specify a different -c directory (e.g. 
>clusters-0)?
>Did you add the -cl argument?
>What is the new CLI invocation for clusterdump?
>Did this work for -k 10? What happens with -k 50?
>Have you tried kmeans?
>
>I can help you better if you will give me answers to my questions
>
>-----Original Message-----
>From: Jeffrey [mailto:[email protected]]
>Sent: Thursday, July 21, 2011 4:30 AM
>To: [email protected]
>Subject: Re: fkmeans or Cluster Dumper not working?
>
>Hi again,
>
>Let me update on what's working and what's not working.
>
>Works:
>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>fkmeans clustering (5 clusters)
>clusterdump (5 clusters) - so points are not included in the clusterdump and I 
>need to write a program for it?
>
>Not Working:
>fkmeans clustering (50 clusters) - same error
>clusterdump (10 clusters) - same error
>
>
>so it seems to attach points to the cluster dumper output like the synthetic 
>control example does, i would have to write some code as pointed by 
>@Frank_Scholten ? 
>https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>
>Best wishes,
>Jeffrey04
>
>>________________________________
>>From: Jeff Eastman <[email protected]>
>>To: "[email protected]" <[email protected]>; Jeffrey 
>><[email protected]>
>>Sent: Wednesday, July 20, 2011 11:53 PM
>>Subject: RE: fkmeans or Cluster Dumper not working?
>>
>>Hi Jeffrey,
>>
>>It is always difficult to debug remotely, but here are some suggestions:
>>- First, you are specifying both an input clusters directory --clusters and 
>>--numClusters clusters so the job is sampling 10 points from your input data 
>>set and writing them to clusteredPoints as the prior clusters for the first 
>>iteration. You should pick a different name for this directory, as the 
>>clusteredPoints directory is used by the -cl (--clustering) option (which you 
>>did not supply) to write out the clustered (classified) input vectors. When 
>>you subsequently supplied clusteredPoints to the clusterdumper it was 
>>expecting a different format and that caused the exception you saw. Change 
>>your --clusters directory (clusters-0 is good) and add a -cl argument and 
>>things should go more smoothly. The -cl option is not the default and so no 
>>clustering of the input points is performed without this (Many people get 
>>caught by this and perhaps the default should be changed, but clustering can 
>>be expensive and so it is not performed without request).
>>- If you still have problems, try again with k-means. The similarity to 
>>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>>problems with k-means
>>- I don't see why changing the -k argument from 10 to 50 should cause any 
>>problems, unless your vectors are very large and you are getting an OME in 
>>the reducer. Since the reducer is calculating centroid vectors for the next 
>>iteration these will become more dense and memory will increase substantially.
>>- I can't figure out what might be causing your second exception. It is 
>>bombing inside of Hadoop file IO and this causes me to suspect command 
>>argument problems.
>>
>>Hope this helps,
>>Jeff
>>
>>
>>-----Original Message-----
>>From: Jeffrey [mailto:[email protected]]
>>Sent: Wednesday, July 20, 2011 2:41 AM
>>To: [email protected]
>>Subject: fkmeans or Cluster Dumper not working?
>>
>>Hi,
>>
>>I am trying to generate clusters using the fkmeans command line tool from my 
>>test data. Not sure if this is correct, as it only runs one iteration (output 
>>from 0.6-snapshot, gotta use some workaround to some weird bug - 
>>http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>> )
>>
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 
>>10 --overwrite --m 5
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>> 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>>--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> --emitMostLikely=true, --endPhase=2147483647, 
>>--input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
>>--numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
>>--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: 
>>Deleting sensei/clusters11/07/20 14:05:20 INFO common.HadoopUtil: Deleting 
>>sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader: Loaded 
>>the native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully
>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO 
>>compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
>>compress.CodecPool: Got brand-new decompressor
>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to 
>>sensei/clusteredPoints/part-randomSeed
>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 
>>1
>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process : 1
>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>11/07/20 14:05:57 INFO mapred.JobClient:  map 5% reduce 0%
>>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce 0%
>>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>11/07/20 14:07:13 INFO mapred.JobClient:  map 65% reduce 0%
>>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce 0%
>>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces 
>>waiting after reserving slots (ms)=0
>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps 
>>waiting after reserving slots (ms)=0
>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters
>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132572666
>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters
>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized 
>>bytes=2246233
>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>>11/07/20 14:08:32 INFO mapred.JobClient:     Spilled Records=590
>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>>
>>if I increase the --numClusters argument (e.g. 50), then it will return 
>>exception after
>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>
>>and would retry again (also reproducible using 0.6-snapshot)
>>
>>...
>>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce 0%
>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
>>attempt_201107201152_0022_m_000000_0, Status : FAILED
>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
>>valid local directory for output/file.out
>>        at 
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>        at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>        at 
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>        at 
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>        at 
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>        at 
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>        at 
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>        at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>
>>11/07/20 14:22:32 INFO mapred.JobClient:  map 0% reduce 0%
>>...
>>
>>Then I ran cluster dumper to dump information about the clusters, this 
>>command would work if I only care about the cluster centroids (both 0.5 
>>release and 0.6-snapshot)
>>
>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>image-tag-clusters.txt
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: 
>>{--dictionaryType=text, --endPhase=2147483647, 
>>--output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
>>--startPhase=0, --tempDir=temp}
>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761 ms
>>
>>but if I want to see the degree of membership of each points, I get another 
>>exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>
>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output 
>>image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>MAHOUT-JOB: 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: 
>>{--dictionaryType=text, --endPhase=2147483647, 
>>--output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
>>--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>>native-zlib library
>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>>Exception in thread "main" java.lang.ClassCastException: 
>>org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
>>        at 
>> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>        at 
>> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>        at 
>> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>        at 
>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>        at 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>erm, would writing a short program to call the API (btw, can't seem to find 
>>the latest API doc?) be a better choice here? Or did I do anything wrong here 
>>(yes, Java is not my main language, and I am very new to Mahout.. and h)?
>>
>>the data is converted from an arff file with about 1000 rows (resource) and 
>>14k columns (tag), and it is just a subset of my data. (actually made a 
>>mistake so it is now generating resource clusters instead of tag clusters, 
>>but I am just doing this as a proof of concept whether mahout is good enough 
>>for the task)
>>
>>Best wishes,
>>Jeffrey04
>>
>>
>>
>
>
>

RE: fkmeans or Cluster Dumper not working?

Reply via email to