Hi,
I'm doing canopy clustering followed by k-means or fuzzy k-means. When I
analyze the ouput with using ClusterDumper, k-means seems to produce a
reasonable set of clusters, but fuzzy-k shows the same top terms for each
category; e.g.:
:SV-0{
Top Terms:
realti =>
5.93836652796093E-4
qtrly =>
5.76913325900417E-4
band =>
5.14547893772126E-4
rebat
=>4.8166533615085237E-4
stoltenberg
=>4.811479747636483E-4
:SC-1{
Top Terms:
realti
=>5.709752304136923E-4
qtrly
=>5.551229996961265E-4
band =>
4.94554555548863E-4
rebat
=>4.633549463551762E-4
stoltenberg
=>4.622723047386884E-4
:SC-10
Top Terms:
realti
=>5.709752304136921E-4
qtrly
=>5.551229996961266E-4
band =>
4.94554555548863E-4
rebat
=>4.633549463551762E-4
stoltenberg
=>4.622723047386884E-4
:SC-10
Top Terms:
realti
=>5.709752304136925E-4
qtrly
=>5.551229996961268E-4
band
=>4.945545555488632E-4
rebat
=>4.633549463551764E-4
stoltenberg
=>4.6227230473868866E-4
... etc...
(This is based on 1-gram TF-IDF over the Reuters dataset).
The only relevent post I could find is
http://comments.gmane.org/gmane.comp.apache.mahout.user/8357 - sounds like
the same issue. I get the same behaviour with 0.5, 0.6 and 0.8-SNAPSHOT
r1545 in trunk.
Could I be doing something wrong calling FuzzyKMeansDriver, or is anyone
else seeing this?
FuzzyKMeansDriver.run(conf, tfIdfVectorsPath, canopyCentroidsPath,
fkmeansOutputPath,
new TanimotoDistanceMeasure(), 0.01, 20,
2.0f, // m
true, // runClustering
true, // emitMostLikely
0.0, // threshold
false); // runSequential