Hi,

I'm doing canopy clustering followed by k-means or fuzzy k-means. When I
analyze the ouput with using ClusterDumper, k-means seems to produce a
reasonable set of clusters, but fuzzy-k shows the same top terms for each
category; e.g.:

:SV-0{
        Top Terms:
                realti                                  =>
5.93836652796093E-4
                qtrly                                   =>
5.76913325900417E-4
                band                                    =>
5.14547893772126E-4
                rebat
=>4.8166533615085237E-4
                stoltenberg
=>4.811479747636483E-4
:SC-1{
        Top Terms:
                realti
 =>5.709752304136923E-4
                qtrly
=>5.551229996961265E-4
                band                                    =>
4.94554555548863E-4
                rebat
=>4.633549463551762E-4
                stoltenberg
=>4.622723047386884E-4
:SC-10
        Top Terms:
                realti
 =>5.709752304136921E-4
                qtrly
=>5.551229996961266E-4
                band                                    =>
4.94554555548863E-4
                rebat
=>4.633549463551762E-4
                stoltenberg
=>4.622723047386884E-4
:SC-10
        Top Terms:
                realti
 =>5.709752304136925E-4
                qtrly
=>5.551229996961268E-4
                band
 =>4.945545555488632E-4
                rebat
=>4.633549463551764E-4
                stoltenberg
=>4.6227230473868866E-4
... etc...

(This is based on 1-gram TF-IDF over the Reuters dataset).

The only relevent post I could find is
http://comments.gmane.org/gmane.comp.apache.mahout.user/8357 - sounds like
the same issue. I get the same behaviour with 0.5, 0.6 and 0.8-SNAPSHOT
r1545 in trunk.

Could I be doing something wrong calling FuzzyKMeansDriver, or is anyone
else seeing this?

FuzzyKMeansDriver.run(conf, tfIdfVectorsPath, canopyCentroidsPath,
fkmeansOutputPath,
new TanimotoDistanceMeasure(), 0.01, 20,
2.0f, // m
true, // runClustering
true, // emitMostLikely
0.0, // threshold
false); // runSequential

Reply via email to