Hi Jeffery! I have encountered this problem as well. The workaround, is to run one iteration of k-means, to create initial cluster assignment and then run fuzzy k-means using the output from the first iteration of k-means.
Hope this helps, Danny Bickson On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey <[email protected]> wrote: > Hi, > > I have a test data that has a number of points, written to a sequence file > using a Clojure script as follows (I am equally just as bad in both JAVA and > Clojure, since I really don't like JAVA I wrote my scripts in Clojure > whenever possible). > > #!./bin/clj > (ns sensei.sequence.core) > > (require 'clojure.string) > (require 'clojure.java.io) > > (import org.apache.hadoop.conf.Configuration) > (import org.apache.hadoop.fs.FileSystem) > (import org.apache.hadoop.fs.Path) > (import org.apache.hadoop.io.SequenceFile) > (import org.apache.hadoop.io.Text) > > (import org.apache.mahout.math.VectorWritable) > (import org.apache.mahout.math.SequentialAccessSparseVector) > > (with-open [reader (clojure.java.io/reader *in*)] > (let [hadoop_configuration ((fn [] > (let [conf (new Configuration)] > (. conf set "fs.default.name" > "hdfs://localhost:9000/") > conf))) > hadoop_fs (FileSystem/get hadoop_configuration)] > (reduce > (fn [writer [index value]] > (. writer append index value) > writer) > (SequenceFile/createWriter > hadoop_fs > hadoop_configuration > (new Path "test/sensei") > Text > VectorWritable) > (map > (fn [[tag row_vector]] > (let [input_index (new Text tag) > input_vector (new VectorWritable)] > (. input_vector set row_vector) > [input_index input_vector])) > (map > (fn [[tag photo_list]] > (let [photo_map (apply hash-map photo_list) > input_vector (new SequentialAccessSparseVector (count > (vals photo_map)))] > (loop [frequency_list (vals photo_map)] > (if (zero? (count frequency_list)) > [tag input_vector] > (when-not (zero? (count frequency_list)) > (. input_vector set > (mod (count frequency_list) (count (vals > photo_map))) > (Integer/parseInt (first frequency_list))) > (recur (rest frequency_list))))))) > (reduce > (fn [result next_line] > (let [[tag photo frequency] (clojure.string/split > next_line #" ")] > (update-in result [tag] > #(if (nil? %) > [photo frequency] > (conj % photo frequency))))) > {} > (line-seq reader))))))) > > Basically the script receives input (from stdin) in this format > > tag_uri image_uri count > > e.g. > > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/13980928@N03/6001200971 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/21207178@N07/5441742937 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/25845846@N06/3033371575 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/30366924@N08/5772100510 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/31343451@N00/5957189406 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/36662563@N00/4815218552 1 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/38583880@N00/5686968462 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/43335486@N00/5794673203 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/46857830@N03/5651576112 0 > http://flickr.com/photos/tags/ísland > http://flickr.com/photos/99996011@N00/5396566822 0 > > Then turn them into sequence file with each entry represents one point (10 > dimensions in this example) with key set to tag_uri < > http://flickr.com/photos/tags/ísland> and value set to point described by > the frequency vector (0 0 0 0 0 1 0 0 0 0) > > I then use a script (available in 2 different variations) to send the data > in as a clustering job, however I am getting error that I don't know how > this can be fixed. It seems that something is wrong with the initial > cluster. > > Script variation 1 > > #!./bin/clj > > (ns sensei.clustering.fkmeans) > > (import org.apache.hadoop.conf.Configuration) > (import org.apache.hadoop.fs.Path) > > (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver) > (import org.apache.mahout.common.distance.EuclideanDistanceMeasure) > (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator) > > (let [hadoop_configuration ((fn [] > (let [conf (new Configuration)] > (. conf set "fs.default.name" > "hdfs://localhost:9000/") > conf))) > driver (new FuzzyKMeansDriver)] > (. driver setConf hadoop_configuration) > (. driver > run > (into-array String ["--input" "test/sensei" > "--output" "test/clusters" > "--clusters" "test/clusters/clusters-0" > "--clustering" > "--overwrite" > "--emitMostLikely" "false" > "--numClusters" "3" > "--maxIter" "10" > "--m" "5"]))) > > Script variation 2: > > #!./bin/clj > > (ns sensei.clustering.fkmeans) > > (import org.apache.hadoop.conf.Configuration) > (import org.apache.hadoop.fs.Path) > > (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver) > (import org.apache.mahout.common.distance.EuclideanDistanceMeasure) > (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator) > > (let [hadoop_configuration ((fn [] > (let [conf (new Configuration)] > (. conf set "fs.default.name" > "hdfs://127.0.0.1:9000/") > conf))) > input_path (new Path "test/sensei") > output_path (new Path "test/clusters") > clusters_in_path (new Path "test/clusters/cluster-0")] > (FuzzyKMeansDriver/run > hadoop_configuration > input_path > (RandomSeedGenerator/buildRandom > hadoop_configuration > input_path > clusters_in_path > (int 2) > (new EuclideanDistanceMeasure)) > output_path > (new EuclideanDistanceMeasure) > (double 0.5) > (int 10) > (float 5.0) > true > false > (double 0.0) > false)) '' runSequential > > I am getting the same error with both variations > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > further details. > 11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor > 11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor > 11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > 11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to > process : 1 > 11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001 > 11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100 > 11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720 > 11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680 > 11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001 > java.lang.IllegalStateException: No clusters found. Check your -c path. > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) > 11/08/25 15:20:18 INFO mapred.JobClient: map 0% reduce 0% > 11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001 > 11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0 > Exception in thread "main" java.lang.RuntimeException: > java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing > test/clusters/cluster-0/part-randomSeed > at clojure.lang.Util.runtimeException(Util.java:153) > at clojure.lang.Compiler.eval(Compiler.java:6417) > at clojure.lang.Compiler.load(Compiler.java:6843) > at clojure.lang.Compiler.loadFile(Compiler.java:6804) > at clojure.main$load_script.invoke(main.clj:282) > at clojure.main$script_opt.invoke(main.clj:342) > at clojure.main$main.doInvoke(main.clj:426) > at clojure.lang.RestFn.invoke(RestFn.java:436) > at clojure.lang.Var.invoke(Var.java:409) > at clojure.lang.AFn.applyToHelper(AFn.java:167) > at clojure.lang.Var.applyTo(Var.java:518) > at clojure.main.main(main.java:37) > Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration > failed processing test/clusters/cluster-0/part-randomSeed > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252) > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421) > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345) > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295) > at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35) > at clojure.lang.Compiler.eval(Compiler.java:6406) > ... 10 more > > Notice there is a runSequential flag for the 2nd variation, if I set it to > true > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > further details. > 11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor > 11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor > Exception in thread "main" java.lang.IllegalStateException: Clusters is > empty! > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361) > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343) > at > org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295) > at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35) > at clojure.lang.Compiler.eval(Compiler.java:6465) > at clojure.lang.Compiler.load(Compiler.java:6902) > at clojure.lang.Compiler.loadFile(Compiler.java:6863) > at clojure.main$load_script.invoke(main.clj:282) > at clojure.main$script_opt.invoke(main.clj:342) > at clojure.main$main.doInvoke(main.clj:426) > at clojure.lang.RestFn.invoke(RestFn.java:436) > at clojure.lang.Var.invoke(Var.java:409) > at clojure.lang.AFn.applyToHelper(AFn.java:167) > at clojure.lang.Var.applyTo(Var.java:518) > at clojure.main.main(main.java:37) > > Now, if I cluster the data using the CLI tool, it will complete without > error > > $ bin/mahout fkmeans --input test/sensei --output test/clusters > --clusters test/clusters/clusters-0 --clustering --overwrite > --emitMostLikely false --numClusters 10 --maxIter 10 --m 5 > > However, even there is this option: --clustering, I am not seeing any > points in the cluster dump generated with this command > > $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 > --pointsDir test/clusters/clusteredPoints --output sensei.txt > > And yeah, the command completed without any error too. > > ... been stuck with this problem over and over again for months, and I > can't still get the clustering done properly :( > > Best wishes, > Jeffrey04
