FuzzyK is pretty sensitive to the value of m, with a lot of overlap among the clusters. There could be a problem with the new implementation; but it leveraged the old code. I wish we had some gold standard against which to test the results.

You can see this difference by running the Display examples for different values of m.

On 6/15/12 1:12 PM, Lithium Guava wrote:
Hi,

I'm doing canopy clustering followed by k-means or fuzzy k-means. When I
analyze the ouput with using ClusterDumper, k-means seems to produce a
reasonable set of clusters, but fuzzy-k shows the same top terms for each
category; e.g.:

:SV-0{
         Top Terms:
                 realti                                  =>
5.93836652796093E-4
                 qtrly                                   =>
5.76913325900417E-4
                 band                                    =>
5.14547893772126E-4
                 rebat
=>4.8166533615085237E-4
                 stoltenberg
=>4.811479747636483E-4
:SC-1{
         Top Terms:
                 realti
  =>5.709752304136923E-4
                 qtrly
=>5.551229996961265E-4
                 band                                    =>
4.94554555548863E-4
                 rebat
=>4.633549463551762E-4
                 stoltenberg
=>4.622723047386884E-4
:SC-10
         Top Terms:
                 realti
  =>5.709752304136921E-4
                 qtrly
=>5.551229996961266E-4
                 band                                    =>
4.94554555548863E-4
                 rebat
=>4.633549463551762E-4
                 stoltenberg
=>4.622723047386884E-4
:SC-10
         Top Terms:
                 realti
  =>5.709752304136925E-4
                 qtrly
=>5.551229996961268E-4
                 band
  =>4.945545555488632E-4
                 rebat
=>4.633549463551764E-4
                 stoltenberg
=>4.6227230473868866E-4
... etc...

(This is based on 1-gram TF-IDF over the Reuters dataset).

The only relevent post I could find is
http://comments.gmane.org/gmane.comp.apache.mahout.user/8357 - sounds like
the same issue. I get the same behaviour with 0.5, 0.6 and 0.8-SNAPSHOT
r1545 in trunk.

Could I be doing something wrong calling FuzzyKMeansDriver, or is anyone
else seeing this?

FuzzyKMeansDriver.run(conf, tfIdfVectorsPath, canopyCentroidsPath,
fkmeansOutputPath,
new TanimotoDistanceMeasure(), 0.01, 20,
2.0f, // m
true, // runClustering
true, // emitMostLikely
0.0, // threshold
false); // runSequential


Reply via email to