Btw, does the clusterdumper return the points of each cluster like the synthetic control data example?
>________________________________ >From: "Choon-Siang "Jeffrey04" Lai" <[email protected]> >To: "[email protected]" <[email protected]> >Sent: Monday, September 12, 2011 3:49 PM >Subject: Re: #clojure #fkmeans - Clustering of Test Data Failed > > >Hi Danny, > > >I have read a small portion of the source code, for variation 1, an initial >cluster will be generated using RandomSeedGenerator if there is none found in >the path so I don't have to do the initial cluster myself. For variation 2, I >actually have generated the initial cluster using this code > > > (RandomSeedGenerator/buildRandom hadoop_configuration input_path clusters_in_path (int 2) (new EuclideanDistanceMeasure)) > > > >I should have also mentioned that I am running my code using mahout >0.6-snapshot :) > > >Thanks for the reply anyway :) > > >best wishes, >Jeffrey04 > > > >>________________________________ >>From: Danny Bickson <[email protected]> >>To: [email protected]; Jeffrey <[email protected]> >>Sent: Monday, September 12, 2011 3:31 PM >>Subject: Re: #clojure #fkmeans - Clustering of Test Data Failed >> >> >>Hi Jeffery! >>I have encountered this problem as well. The workaround, is to run one >>iteration of k-means, to create initial cluster assignment and >>then run fuzzy k-means using the output from the first iteration of k-means. >> >>Hope this helps, >> >>Danny Bickson >> >> >>On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey <[email protected]> wrote: >> >>Hi, >>> >>>I have a test data that has a number of points, written to a sequence file >>>using a Clojure script as follows (I am equally just as bad in both JAVA and >>>Clojure, since I really don't like JAVA I wrote my scripts in Clojure >>>whenever possible). >>> >>> #!./bin/clj >>> (ns sensei.sequence.core) >>> >>> (require 'clojure.string) >>> (require 'clojure.java.io) >>> >>> (import org.apache.hadoop.conf.Configuration) >>> (import org.apache.hadoop.fs.FileSystem) >>> (import org.apache.hadoop.fs.Path) >>> (import org.apache.hadoop.io.SequenceFile) >>> (import org.apache.hadoop.io.Text) >>> >>> (import org.apache.mahout.math.VectorWritable) >>> (import org.apache.mahout.math.SequentialAccessSparseVector) >>> >>> (with-open [reader (clojure.java.io/reader *in*)] >>> (let [hadoop_configuration ((fn [] >>> (let [conf (new Configuration)] >>> (. conf set "fs.default.name" >>>"hdfs://localhost:9000/") >>> conf))) >>> hadoop_fs (FileSystem/get hadoop_configuration)] >>> (reduce >>> (fn [writer [index value]] >>> (. writer append index value) >>> writer) >>> (SequenceFile/createWriter >>> hadoop_fs >>> hadoop_configuration >>> (new Path "test/sensei") >>> Text >>> VectorWritable) >>> (map >>> (fn [[tag row_vector]] >>> (let [input_index (new Text tag) >>> input_vector (new VectorWritable)] >>> (. input_vector set row_vector) >>> [input_index input_vector])) >>> (map >>> (fn [[tag photo_list]] >>> (let [photo_map (apply hash-map photo_list) >>> input_vector (new SequentialAccessSparseVector (count >>>(vals photo_map)))] >>> (loop [frequency_list (vals photo_map)] >>> (if (zero? (count frequency_list)) >>> [tag input_vector] >>> (when-not (zero? (count frequency_list)) >>> (. input_vector set >>> (mod (count frequency_list) (count (vals >>>photo_map))) >>> (Integer/parseInt (first frequency_list))) >>> (recur (rest frequency_list))))))) >>> (reduce >>> (fn [result next_line] >>> (let [[tag photo frequency] (clojure.string/split >>>next_line #" ")] >>> (update-in result [tag] >>> #(if (nil? %) >>> [photo frequency] >>> (conj % photo frequency))))) >>> {} >>> (line-seq reader))))))) >>> >>>Basically the script receives input (from stdin) in this format >>> >>> tag_uri image_uri count >>> >>>e.g. >>> >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/13980928@N03/6001200971 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/21207178@N07/5441742937 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/25845846@N06/3033371575 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/30366924@N08/5772100510 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/31343451@N00/5957189406 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/36662563@N00/4815218552 1 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/38583880@N00/5686968462 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/43335486@N00/5794673203 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/46857830@N03/5651576112 0 >>> http://flickr.com/photos/tags/ísland >>>http://flickr.com/photos/99996011@N00/5396566822 0 >>> >>>Then turn them into sequence file with each entry represents one point (10 >>>dimensions in this example) with key set to tag_uri >>><http://flickr.com/photos/tags/ísland> and value set to point described by >>>the frequency vector (0 0 0 0 0 1 0 0 0 0) >>> >>>I then use a script (available in 2 different variations) to send the data >>>in as a clustering job, however I am getting error that I don't know how >>>this can be fixed. It seems that something is wrong with the initial cluster. >>> >>>Script variation 1 >>> >>> #!./bin/clj >>> >>> (ns sensei.clustering.fkmeans) >>> >>> (import org.apache.hadoop.conf.Configuration) >>> (import org.apache.hadoop.fs.Path) >>> >>> (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver) >>> (import org.apache.mahout.common.distance.EuclideanDistanceMeasure) >>> (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator) >>> >>> (let [hadoop_configuration ((fn [] >>> (let [conf (new Configuration)] >>> (. conf set "fs.default.name" >>>"hdfs://localhost:9000/") >>> conf))) >>> driver (new FuzzyKMeansDriver)] >>> (. driver setConf hadoop_configuration) >>> (. driver >>> run >>> (into-array String ["--input" "test/sensei" >>> "--output" "test/clusters" >>> "--clusters" "test/clusters/clusters-0" >>> "--clustering" >>> "--overwrite" >>> "--emitMostLikely" "false" >>> "--numClusters" "3" >>> "--maxIter" "10" >>> "--m" "5"]))) >>> >>>Script variation 2: >>> >>> #!./bin/clj >>> >>> (ns sensei.clustering.fkmeans) >>> >>> (import org.apache.hadoop.conf.Configuration) >>> (import org.apache.hadoop.fs.Path) >>> >>> (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver) >>> (import org.apache.mahout.common.distance.EuclideanDistanceMeasure) >>> (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator) >>> >>> (let [hadoop_configuration ((fn [] >>> (let [conf (new Configuration)] >>> (. conf set "fs.default.name" >>>"hdfs://127.0.0.1:9000/") >>> conf))) >>> input_path (new Path "test/sensei") >>> output_path (new Path "test/clusters") >>> clusters_in_path (new Path "test/clusters/cluster-0")] >>> (FuzzyKMeansDriver/run >>> hadoop_configuration >>> input_path >>> (RandomSeedGenerator/buildRandom >>> hadoop_configuration >>> input_path >>> clusters_in_path >>> (int 2) >>> (new EuclideanDistanceMeasure)) >>> output_path >>> (new EuclideanDistanceMeasure) >>> (double 0.5) >>> (int 10) >>> (float 5.0) >>> true >>> false >>> (double 0.0) >>> false)) '' runSequential >>> >>>I am getting the same error with both variations >>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >>> SLF4J: Defaulting to no-operation (NOP) logger implementation >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for >>>further details. >>> 11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load >>>native-hadoop library for your platform... using builtin-java classes where >>>applicable >>> 11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor >>> 11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor >>> 11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for >>>parsing the arguments. Applications should implement Tool for the same. >>> 11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to >>>process : 1 >>> 11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001 >>> 11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100 >>> 11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720 >>> 11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680 >>> 11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001 >>> java.lang.IllegalStateException: No clusters found. Check your -c path. >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) >>> at >>>org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >>> at >>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) >>> 11/08/25 15:20:18 INFO mapred.JobClient: map 0% reduce 0% >>> 11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001 >>> 11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0 >>> Exception in thread "main" java.lang.RuntimeException: >>>java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing >>>test/clusters/cluster-0/part-randomSeed >>> at clojure.lang.Util.runtimeException(Util.java:153) >>> at clojure.lang.Compiler.eval(Compiler.java:6417) >>> at clojure.lang.Compiler.load(Compiler.java:6843) >>> at clojure.lang.Compiler.loadFile(Compiler.java:6804) >>> at clojure.main$load_script.invoke(main.clj:282) >>> at clojure.main$script_opt.invoke(main.clj:342) >>> at clojure.main$main.doInvoke(main.clj:426) >>> at clojure.lang.RestFn.invoke(RestFn.java:436) >>> at clojure.lang.Var.invoke(Var.java:409) >>> at clojure.lang.AFn.applyToHelper(AFn.java:167) >>> at clojure.lang.Var.applyTo(Var.java:518) >>> at clojure.main.main(main.java:37) >>> Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration >>>failed processing test/clusters/cluster-0/part-randomSeed >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252) >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421) >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345) >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295) >>> at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35) >>> at clojure.lang.Compiler.eval(Compiler.java:6406) >>> ... 10 more >>> >>>Notice there is a runSequential flag for the 2nd variation, if I set it to >>>true >>> >>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >>> SLF4J: Defaulting to no-operation (NOP) logger implementation >>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for >>>further details. >>> 11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load >>>native-hadoop library for your platform... using builtin-java classes where >>>applicable >>> 11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor >>> 11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor >>> Exception in thread "main" java.lang.IllegalStateException: Clusters is >>>empty! >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361) >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343) >>> at >>>org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295) >>> at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35) >>> at clojure.lang.Compiler.eval(Compiler.java:6465) >>> at clojure.lang.Compiler.load(Compiler.java:6902) >>> at clojure.lang.Compiler.loadFile(Compiler.java:6863) >>> at clojure.main$load_script.invoke(main.clj:282) >>> at clojure.main$script_opt.invoke(main.clj:342) >>> at clojure.main$main.doInvoke(main.clj:426) >>> at clojure.lang.RestFn.invoke(RestFn.java:436) >>> at clojure.lang.Var.invoke(Var.java:409) >>> at clojure.lang.AFn.applyToHelper(AFn.java:167) >>> at clojure.lang.Var.applyTo(Var.java:518) >>> at clojure.main.main(main.java:37) >>> >>>Now, if I cluster the data using the CLI tool, it will complete without error >>> >>> $ bin/mahout fkmeans --input test/sensei --output test/clusters >>>--clusters test/clusters/clusters-0 --clustering --overwrite >>>--emitMostLikely false --numClusters 10 --maxIter 10 --m 5 >>> >>>However, even there is this option: --clustering, I am not seeing any points >>>in the cluster dump generated with this command >>> >>> $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 >>>--pointsDir test/clusters/clusteredPoints --output sensei.txt >>> >>>And yeah, the command completed without any error too. >>> >>>... been stuck with this problem over and over again for months, and I can't >>>still get the clustering done properly :( >>> >>>Best wishes, >>>Jeffrey04 >> >> >> > >
