Hi Danny, I have read a small portion of the source code, for variation 1, an initial cluster will be generated using RandomSeedGenerator if there is none found in the path so I don't have to do the initial cluster myself. For variation 2, I actually have generated the initial cluster using this code
(RandomSeedGenerator/buildRandom hadoop_configuration input_path clusters_in_path (int 2) (new EuclideanDistanceMeasure)) I should have also mentioned that I am running my code using mahout 0.6-snapshot :) Thanks for the reply anyway :) best wishes, Jeffrey04 >________________________________ >From: Danny Bickson <[email protected]> >To: [email protected]; Jeffrey <[email protected]> >Sent: Monday, September 12, 2011 3:31 PM >Subject: Re: #clojure #fkmeans - Clustering of Test Data Failed > > >Hi Jeffery! >I have encountered this problem as well. The workaround, is to run one >iteration of k-means, to create initial cluster assignment and >then run fuzzy k-means using the output from the first iteration of k-means. > >Hope this helps, > >Danny Bickson > > >On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey <[email protected]> wrote: > >Hi, >> >>I have a test data that has a number of points, written to a sequence file >>using a Clojure script as follows (I am equally just as bad in both JAVA and >>Clojure, since I really don't like JAVA I wrote my scripts in Clojure >>whenever possible). >> >> #!./bin/clj >> (ns sensei.sequence.core) >> >> (require 'clojure.string) >> (require 'clojure.java.io) >> >> (import org.apache.hadoop.conf.Configuration) >> (import org.apache.hadoop.fs.FileSystem) >> (import org.apache.hadoop.fs.Path) >> (import org.apache.hadoop.io.SequenceFile) >> (import org.apache.hadoop.io.Text) >> >> (import org.apache.mahout.math.VectorWritable) >> (import org.apache.mahout.math.SequentialAccessSparseVector) >> >> (with-open [reader (clojure.java.io/reader *in*)] >> (let [hadoop_configuration ((fn [] >> (let [conf (new Configuration)] >> (. conf set "fs.default.name" >>"hdfs://localhost:9000/") >> conf))) >> hadoop_fs (FileSystem/get hadoop_configuration)] >> (reduce >> (fn [writer [index value]] >> (. writer append index value) >> writer) >> (SequenceFile/createWriter >> hadoop_fs >> hadoop_configuration >> (new Path "test/sensei") >> Text >> VectorWritable) >> (map >> (fn [[tag row_vector]] >> (let [input_index (new Text tag) >> input_vector (new VectorWritable)] >> (. input_vector set row_vector) >> [input_index input_vector])) >> (map >> (fn [[tag photo_list]] >> (let [photo_map (apply hash-map photo_list) >> input_vector (new SequentialAccessSparseVector (count >>(vals photo_map)))] >> (loop [frequency_list (vals photo_map)] >> (if (zero? (count frequency_list)) >> [tag input_vector] >> (when-not (zero? (count frequency_list)) >> (. input_vector set >> (mod (count frequency_list) (count (vals >>photo_map))) >> (Integer/parseInt (first frequency_list))) >> (recur (rest frequency_list))))))) >> (reduce >> (fn [result next_line] >> (let [[tag photo frequency] (clojure.string/split next_line >>#" ")] >> (update-in result [tag] >> #(if (nil? %) >> [photo frequency] >> (conj % photo frequency))))) >> {} >> (line-seq reader))))))) >> >>Basically the script receives input (from stdin) in this format >> >> tag_uri image_uri count >> >>e.g. >> >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/13980928@N03/6001200971 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/21207178@N07/5441742937 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/25845846@N06/3033371575 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/30366924@N08/5772100510 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/31343451@N00/5957189406 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/36662563@N00/4815218552 1 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/38583880@N00/5686968462 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/43335486@N00/5794673203 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/46857830@N03/5651576112 0 >> http://flickr.com/photos/tags/ísland >>http://flickr.com/photos/99996011@N00/5396566822 0 >> >>Then turn them into sequence file with each entry represents one point (10 >>dimensions in this example) with key set to tag_uri >><http://flickr.com/photos/tags/ísland> and value set to point described by >>the frequency vector (0 0 0 0 0 1 0 0 0 0) >> >>I then use a script (available in 2 different variations) to send the data in >>as a clustering job, however I am getting error that I don't know how this >>can be fixed. It seems that something is wrong with the initial cluster. >> >>Script variation 1 >> >> #!./bin/clj >> >> (ns sensei.clustering.fkmeans) >> >> (import org.apache.hadoop.conf.Configuration) >> (import org.apache.hadoop.fs.Path) >> >> (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver) >> (import org.apache.mahout.common.distance.EuclideanDistanceMeasure) >> (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator) >> >> (let [hadoop_configuration ((fn [] >> (let [conf (new Configuration)] >> (. conf set "fs.default.name" >>"hdfs://localhost:9000/") >> conf))) >> driver (new FuzzyKMeansDriver)] >> (. driver setConf hadoop_configuration) >> (. driver >> run >> (into-array String ["--input" "test/sensei" >> "--output" "test/clusters" >> "--clusters" "test/clusters/clusters-0" >> "--clustering" >> "--overwrite" >> "--emitMostLikely" "false" >> "--numClusters" "3" >> "--maxIter" "10" >> "--m" "5"]))) >> >>Script variation 2: >> >> #!./bin/clj >> >> (ns sensei.clustering.fkmeans) >> >> (import org.apache.hadoop.conf.Configuration) >> (import org.apache.hadoop.fs.Path) >> >> (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver) >> (import org.apache.mahout.common.distance.EuclideanDistanceMeasure) >> (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator) >> >> (let [hadoop_configuration ((fn [] >> (let [conf (new Configuration)] >> (. conf set "fs.default.name" >>"hdfs://127.0.0.1:9000/") >> conf))) >> input_path (new Path "test/sensei") >> output_path (new Path "test/clusters") >> clusters_in_path (new Path "test/clusters/cluster-0")] >> (FuzzyKMeansDriver/run >> hadoop_configuration >> input_path >> (RandomSeedGenerator/buildRandom >> hadoop_configuration >> input_path >> clusters_in_path >> (int 2) >> (new EuclideanDistanceMeasure)) >> output_path >> (new EuclideanDistanceMeasure) >> (double 0.5) >> (int 10) >> (float 5.0) >> true >> false >> (double 0.0) >> false)) '' runSequential >> >>I am getting the same error with both variations >> >> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >> SLF4J: Defaulting to no-operation (NOP) logger implementation >> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further >>details. >> 11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load >>native-hadoop library for your platform... using builtin-java classes where >>applicable >> 11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor >> 11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor >> 11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for >>parsing the arguments. Applications should implement Tool for the same. >> 11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to >>process : 1 >> 11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001 >> 11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100 >> 11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720 >> 11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680 >> 11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001 >> java.lang.IllegalStateException: No clusters found. Check your -c path. >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >> at >>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) >> 11/08/25 15:20:18 INFO mapred.JobClient: map 0% reduce 0% >> 11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001 >> 11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0 >> Exception in thread "main" java.lang.RuntimeException: >>java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing >>test/clusters/cluster-0/part-randomSeed >> at clojure.lang.Util.runtimeException(Util.java:153) >> at clojure.lang.Compiler.eval(Compiler.java:6417) >> at clojure.lang.Compiler.load(Compiler.java:6843) >> at clojure.lang.Compiler.loadFile(Compiler.java:6804) >> at clojure.main$load_script.invoke(main.clj:282) >> at clojure.main$script_opt.invoke(main.clj:342) >> at clojure.main$main.doInvoke(main.clj:426) >> at clojure.lang.RestFn.invoke(RestFn.java:436) >> at clojure.lang.Var.invoke(Var.java:409) >> at clojure.lang.AFn.applyToHelper(AFn.java:167) >> at clojure.lang.Var.applyTo(Var.java:518) >> at clojure.main.main(main.java:37) >> Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration failed >>processing test/clusters/cluster-0/part-randomSeed >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252) >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421) >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345) >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295) >> at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35) >> at clojure.lang.Compiler.eval(Compiler.java:6406) >> ... 10 more >> >>Notice there is a runSequential flag for the 2nd variation, if I set it to >>true >> >> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >> SLF4J: Defaulting to no-operation (NOP) logger implementation >> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further >>details. >> 11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load >>native-hadoop library for your platform... using builtin-java classes where >>applicable >> 11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor >> 11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor >> Exception in thread "main" java.lang.IllegalStateException: Clusters is >>empty! >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361) >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343) >> at >>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295) >> at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35) >> at clojure.lang.Compiler.eval(Compiler.java:6465) >> at clojure.lang.Compiler.load(Compiler.java:6902) >> at clojure.lang.Compiler.loadFile(Compiler.java:6863) >> at clojure.main$load_script.invoke(main.clj:282) >> at clojure.main$script_opt.invoke(main.clj:342) >> at clojure.main$main.doInvoke(main.clj:426) >> at clojure.lang.RestFn.invoke(RestFn.java:436) >> at clojure.lang.Var.invoke(Var.java:409) >> at clojure.lang.AFn.applyToHelper(AFn.java:167) >> at clojure.lang.Var.applyTo(Var.java:518) >> at clojure.main.main(main.java:37) >> >>Now, if I cluster the data using the CLI tool, it will complete without error >> >> $ bin/mahout fkmeans --input test/sensei --output test/clusters >>--clusters test/clusters/clusters-0 --clustering --overwrite --emitMostLikely >>false --numClusters 10 --maxIter 10 --m 5 >> >>However, even there is this option: --clustering, I am not seeing any points >>in the cluster dump generated with this command >> >> $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 >>--pointsDir test/clusters/clusteredPoints --output sensei.txt >> >>And yeah, the command completed without any error too. >> >>... been stuck with this problem over and over again for months, and I can't >>still get the clustering done properly :( >> >>Best wishes, >>Jeffrey04 > > >
