RE: fuzzy kmeans - all cluster with the same top terms

Jeff Eastman Wed, 17 Aug 2011 11:44:15 -0700

I agree there may be something amiss with FuzzyK. If you compare the circa 0.4 
wiki photo of running DisplayFuzzyKMeans 
(https://cwiki.apache.org/confluence/display/MAHOUT/Fuzzy+K-Means) with the 
current output of that example, you will see that is not generating tight 
clusters as well now as before on the same data. It could very well be that the 
distance-to-membership% calculations (computeProbWeight) have gotten bent 
during some of the refactoring which has occurred in the interim.

I'm looking at that code and don't see anything obvious but more eyeballs would 
help. The display example is, by default, running an experimental version of 
the algorithm using the ClusterClassifier which does not really deal with m, so 
you will need to set the Boolean runClusterer=true to use the regular 
sequential algorithm. That example uses 2-d vectors on a small field that is 
easier to debug than the mapreduce version.

Or, it might just be the curse of dimensionality on your data that is causing 
all the distances to be about equal.

Jeff

-----Original Message-----
From: Jeff Hansen [mailto:[email protected]] 
Sent: Wednesday, August 17, 2011 9:33 AM
To: [email protected]
Subject: Re: fuzzy kmeans - all cluster with the same top terms

I'm hitting the same problem.

I'm using movie description data to try clustering movies (the descriptive
text is from freebase.com).  Kmeans was working fine for me, but when I
tried out fuzzy-kmeans (using trunk) I get the same experience as you Paulo.

Here's the parameters I'm passing to MahoutDriver job:
fkmeans -i movies-vectors/tfidf-vectors -o movies-clusters/fkmeans -k 10
--maxIter 10 --clusters clusters -cd 0.1 -m 2 -ow -cl -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure
(I also tried Tanimoto distance with the same results)

I've been running it locally so I can step through the code in eclipse, but
I can't tell if what I'm seeing is normal. In the mapper I notice that the
distances in the clusterDistanceList all tend to come back very very similar
(nearly always 1 for tanimoto and nearly always 1.4 (sqrt of 2) for
Euclidean distance).  My vectors 39311 long (using trigrams with
minloglikelihood of 50) and there all normalized with n=2.

I guess my next step will be to step through the standard kmeans code and
see if the distances come back much different from there.

On Fri, Jul 1, 2011 at 4:37 PM, Paulo Magalhaes
<[email protected]>wrote:

> Hi all,
>
> I believe there is something wrong with fkmeans in trunk.
>
> I am using code from trunk (last checkout 6/30/11). To recreate is very
> simple:
> 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
> 2) run build-reuters.sh
> 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt
> sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o
> ./reuters-clusterdump.txt  -d
> ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
>
> if you check reuters-clusterdump.txt, you wil notice that all the top terms
> are the same as well as the number of documents in the cluster.
>
> It is my first time trying to use it so, there is a good chance I'm doing
> something wrong :).
> Is it something I should report in the issue tracker ?
>
> Thanks in advance,
> Paulo.
>

RE: fuzzy kmeans - all cluster with the same top terms

Reply via email to