Re: #clojure #fkmeans - Clustering of Test Data Failed

Danny Bickson Mon, 12 Sep 2011 00:32:58 -0700

Hi Jeffery!
I have encountered this problem as well. The workaround, is to run one
iteration of k-means, to create initial cluster assignment and
then run fuzzy k-means using the output from the first iteration of k-means.


Hope this helps,

Danny Bickson

On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey <[email protected]> wrote:

> Hi,
>
> I have a test data that has a number of points, written to a sequence file
> using a Clojure script as follows (I am equally just as bad in both JAVA and
> Clojure, since I really don't like JAVA I wrote my scripts in Clojure
> whenever possible).
>
>     #!./bin/clj
>     (ns sensei.sequence.core)
>
>     (require 'clojure.string)
>     (require 'clojure.java.io)
>
>     (import org.apache.hadoop.conf.Configuration)
>     (import org.apache.hadoop.fs.FileSystem)
>     (import org.apache.hadoop.fs.Path)
>     (import org.apache.hadoop.io.SequenceFile)
>     (import org.apache.hadoop.io.Text)
>
>     (import org.apache.mahout.math.VectorWritable)
>     (import org.apache.mahout.math.SequentialAccessSparseVector)
>
>     (with-open [reader (clojure.java.io/reader *in*)]
>       (let [hadoop_configuration ((fn []
>                                     (let [conf (new Configuration)]
>                                       (. conf set "fs.default.name"
> "hdfs://localhost:9000/")
>                                       conf)))
>             hadoop_fs (FileSystem/get hadoop_configuration)]
>         (reduce
>           (fn [writer [index value]]
>             (. writer append index value)
>             writer)
>           (SequenceFile/createWriter
>             hadoop_fs
>             hadoop_configuration
>             (new Path "test/sensei")
>             Text
>             VectorWritable)
>           (map
>             (fn [[tag row_vector]]
>               (let [input_index (new Text tag)
>                     input_vector (new VectorWritable)]
>                 (. input_vector set row_vector)
>                 [input_index input_vector]))
>             (map
>               (fn [[tag photo_list]]
>                 (let [photo_map (apply hash-map photo_list)
>                       input_vector (new SequentialAccessSparseVector (count
> (vals photo_map)))]
>                   (loop [frequency_list (vals photo_map)]
>                     (if (zero? (count frequency_list))
>                       [tag input_vector]
>                       (when-not (zero? (count frequency_list))
>                         (. input_vector set
>                            (mod (count frequency_list) (count (vals
> photo_map)))
>                            (Integer/parseInt (first frequency_list)))
>                         (recur (rest frequency_list)))))))
>               (reduce
>                 (fn [result next_line]
>                   (let [[tag photo frequency] (clojure.string/split
> next_line #" ")]
>                     (update-in result [tag]
>                       #(if (nil? %)
>                          [photo frequency]
>                          (conj % photo frequency)))))
>                 {}
>                 (line-seq reader)))))))
>
> Basically the script receives input (from stdin) in this format
>
>     tag_uri image_uri count
>
> e.g.
>
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/13980928@N03/6001200971 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/21207178@N07/5441742937 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/25845846@N06/3033371575 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/30366924@N08/5772100510 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/31343451@N00/5957189406 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/36662563@N00/4815218552 1
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/38583880@N00/5686968462 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/43335486@N00/5794673203 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/46857830@N03/5651576112 0
>     http://flickr.com/photos/tags/ísland
> http://flickr.com/photos/99996011@N00/5396566822 0
>
> Then turn them into sequence file with each entry represents one point (10
> dimensions in this example) with key set to tag_uri <
> http://flickr.com/photos/tags/ísland> and value set to point described by
> the frequency vector (0 0 0 0 0 1 0 0 0 0)
>
> I then use a script (available in 2 different variations) to send the data
> in as a clustering job, however I am getting error that I don't know how
> this can be fixed. It seems that something is wrong with the initial
> cluster.
>
> Script variation 1
>
>     #!./bin/clj
>
>     (ns sensei.clustering.fkmeans)
>
>     (import org.apache.hadoop.conf.Configuration)
>     (import org.apache.hadoop.fs.Path)
>
>     (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
>     (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
>     (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
>
>     (let [hadoop_configuration ((fn []
>                                     (let [conf (new Configuration)]
>                                       (. conf set "fs.default.name"
> "hdfs://localhost:9000/")
>                                       conf)))
>           driver (new FuzzyKMeansDriver)]
>       (. driver setConf hadoop_configuration)
>       (. driver
>          run
>          (into-array String ["--input" "test/sensei"
>                              "--output" "test/clusters"
>                              "--clusters" "test/clusters/clusters-0"
>                              "--clustering"
>                              "--overwrite"
>                              "--emitMostLikely" "false"
>                              "--numClusters" "3"
>                              "--maxIter" "10"
>                              "--m" "5"])))
>
> Script variation 2:
>
>     #!./bin/clj
>
>     (ns sensei.clustering.fkmeans)
>
>     (import org.apache.hadoop.conf.Configuration)
>     (import org.apache.hadoop.fs.Path)
>
>     (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
>     (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
>     (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
>
>     (let [hadoop_configuration ((fn []
>                                     (let [conf (new Configuration)]
>                                       (. conf set "fs.default.name"
> "hdfs://127.0.0.1:9000/")
>                                       conf)))
>           input_path (new Path "test/sensei")
>           output_path (new Path "test/clusters")
>           clusters_in_path (new Path "test/clusters/cluster-0")]
>       (FuzzyKMeansDriver/run
>         hadoop_configuration
>         input_path
>         (RandomSeedGenerator/buildRandom
>           hadoop_configuration
>           input_path
>           clusters_in_path
>           (int 2)
>           (new EuclideanDistanceMeasure))
>         output_path
>         (new EuclideanDistanceMeasure)
>         (double 0.5)
>         (int 10)
>         (float 5.0)
>         true
>         false
>         (double 0.0)
>         false)) '' runSequential
>
> I am getting the same error with both variations
>
>     SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>     SLF4J: Defaulting to no-operation (NOP) logger implementation
>     SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> further details.
>     11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>     11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor
>     11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor
>     11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
>     11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to
> process : 1
>     11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001
>     11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100
>     11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720
>     11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680
>     11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001
>     java.lang.IllegalStateException: No clusters found. Check your -c path.
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62)
>             at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>             at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>             at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>     11/08/25 15:20:18 INFO mapred.JobClient:  map 0% reduce 0%
>     11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001
>     11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0
>     Exception in thread "main" java.lang.RuntimeException:
> java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing
> test/clusters/cluster-0/part-randomSeed
>             at clojure.lang.Util.runtimeException(Util.java:153)
>             at clojure.lang.Compiler.eval(Compiler.java:6417)
>             at clojure.lang.Compiler.load(Compiler.java:6843)
>             at clojure.lang.Compiler.loadFile(Compiler.java:6804)
>             at clojure.main$load_script.invoke(main.clj:282)
>             at clojure.main$script_opt.invoke(main.clj:342)
>             at clojure.main$main.doInvoke(main.clj:426)
>             at clojure.lang.RestFn.invoke(RestFn.java:436)
>             at clojure.lang.Var.invoke(Var.java:409)
>             at clojure.lang.AFn.applyToHelper(AFn.java:167)
>             at clojure.lang.Var.applyTo(Var.java:518)
>             at clojure.main.main(main.java:37)
>     Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration
> failed processing test/clusters/cluster-0/part-randomSeed
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252)
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421)
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345)
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
>             at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
>             at clojure.lang.Compiler.eval(Compiler.java:6406)
>             ... 10 more
>
> Notice there is a runSequential flag for the 2nd variation, if I set it to
> true
>
>     SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>     SLF4J: Defaulting to no-operation (NOP) logger implementation
>     SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> further details.
>     11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>     11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor
>     11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor
>     Exception in thread "main" java.lang.IllegalStateException: Clusters is
> empty!
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361)
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343)
>             at
> org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
>             at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
>             at clojure.lang.Compiler.eval(Compiler.java:6465)
>             at clojure.lang.Compiler.load(Compiler.java:6902)
>             at clojure.lang.Compiler.loadFile(Compiler.java:6863)
>             at clojure.main$load_script.invoke(main.clj:282)
>             at clojure.main$script_opt.invoke(main.clj:342)
>             at clojure.main$main.doInvoke(main.clj:426)
>             at clojure.lang.RestFn.invoke(RestFn.java:436)
>             at clojure.lang.Var.invoke(Var.java:409)
>             at clojure.lang.AFn.applyToHelper(AFn.java:167)
>             at clojure.lang.Var.applyTo(Var.java:518)
>             at clojure.main.main(main.java:37)
>
> Now, if I cluster the data using the CLI tool, it will complete without
> error
>
>     $ bin/mahout fkmeans --input test/sensei --output test/clusters
> --clusters test/clusters/clusters-0 --clustering --overwrite
> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>
> However, even there is this option: --clustering, I am not seeing any
> points in the cluster dump generated with this command
>
>     $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1
> --pointsDir test/clusters/clusteredPoints --output sensei.txt
>
> And yeah, the command completed without any error too.
>
> ... been stuck with this problem over and over again for months, and I
> can't still get the clustering done properly :(
>
> Best wishes,
> Jeffrey04

Re: #clojure #fkmeans - Clustering of Test Data Failed

Reply via email to