I'm hitting the same problem. I'm using movie description data to try clustering movies (the descriptive text is from freebase.com). Kmeans was working fine for me, but when I tried out fuzzy-kmeans (using trunk) I get the same experience as you Paulo.
Here's the parameters I'm passing to MahoutDriver job: fkmeans -i movies-vectors/tfidf-vectors -o movies-clusters/fkmeans -k 10 --maxIter 10 --clusters clusters -cd 0.1 -m 2 -ow -cl -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure (I also tried Tanimoto distance with the same results) I've been running it locally so I can step through the code in eclipse, but I can't tell if what I'm seeing is normal. In the mapper I notice that the distances in the clusterDistanceList all tend to come back very very similar (nearly always 1 for tanimoto and nearly always 1.4 (sqrt of 2) for Euclidean distance). My vectors 39311 long (using trigrams with minloglikelihood of 50) and there all normalized with n=2. I guess my next step will be to step through the standard kmeans code and see if the distances come back much different from there. On Fri, Jul 1, 2011 at 4:37 PM, Paulo Magalhaes <[email protected]>wrote: > Hi all, > > I believe there is something wrong with fkmeans in trunk. > > I am using code from trunk (last checkout 6/30/11). To recreate is very > simple: > 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2 > 2) run build-reuters.sh > 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt > sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o > ./reuters-clusterdump.txt -d > ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 > > if you check reuters-clusterdump.txt, you wil notice that all the top terms > are the same as well as the number of documents in the cluster. > > It is my first time trying to use it so, there is a good chance I'm doing > something wrong :). > Is it something I should report in the issue tracker ? > > Thanks in advance, > Paulo. >
