Re: fuzzy kmeans - all cluster with the same top terms

Jeff Hansen Wed, 17 Aug 2011 09:33:04 -0700

I'm hitting the same problem.

I'm using movie description data to try clustering movies (the descriptive
text is from freebase.com).  Kmeans was working fine for me, but when I
tried out fuzzy-kmeans (using trunk) I get the same experience as you Paulo.

Here's the parameters I'm passing to MahoutDriver job:
fkmeans -i movies-vectors/tfidf-vectors -o movies-clusters/fkmeans -k 10
--maxIter 10 --clusters clusters -cd 0.1 -m 2 -ow -cl -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure
(I also tried Tanimoto distance with the same results)

I've been running it locally so I can step through the code in eclipse, but
I can't tell if what I'm seeing is normal. In the mapper I notice that the
distances in the clusterDistanceList all tend to come back very very similar
(nearly always 1 for tanimoto and nearly always 1.4 (sqrt of 2) for
Euclidean distance).  My vectors 39311 long (using trigrams with
minloglikelihood of 50) and there all normalized with n=2.

I guess my next step will be to step through the standard kmeans code and
see if the distances come back much different from there.

On Fri, Jul 1, 2011 at 4:37 PM, Paulo Magalhaes
<[email protected]>wrote:

> Hi all,
>
> I believe there is something wrong with fkmeans in trunk.
>
> I am using code from trunk (last checkout 6/30/11). To recreate is very
> simple:
> 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
> 2) run build-reuters.sh
> 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt
> sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o
> ./reuters-clusterdump.txt  -d
> ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
>
> if you check reuters-clusterdump.txt, you wil notice that all the top terms
> are the same as well as the number of documents in the cluster.
>
> It is my first time trying to use it so, there is a good chance I'm doing
> something wrong :).
> Is it something I should report in the issue tracker ?
>
> Thanks in advance,
> Paulo.
>

Re: fuzzy kmeans - all cluster with the same top terms

Reply via email to