My 2 cents.
   It is always tricky to get clustering right, especially kmeans.
Especially when running clustering on a sparse dataset which these
wordvectors tend to be (There can be only a subset of words present in a
given document from whole corpus).

If all you are looking for is grouping *document* together, then I think
probably topic modeling might give you better results.



On Wed, Jun 5, 2013 at 10:47 PM, Jesvin Jose <[email protected]>wrote:

> I tried to cluster 1000 emails of a person using Kmeans, but clusters are
> not forming okay. For example if Facebook sends notifications about James
> Doe and 5 other people, I get 5 clusters like:
>
> :VL-858{n=7
>     Top Terms:
>         doe                                   =>  10.066998481750488
>         james                                =>  10.066998481750488
>
> Why are notifications for all 5 people not getting clustered together? I
> used variants of the commands used in Mahout in Action, Sean Owen et al as
> follows:
>
> Vectorizing uses lowercasing, stop words and length filter:
>
> bin/hadoop jar
>
> /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
> org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
> mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
> tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq
>
> Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
> similar results but half the number of emails "get into" any cluster.
>
> bin/hadoop jar
>
> /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
> org.apache.mahout.driver.MahoutDriver kmeans -i
> mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
> mymail-kmeans-clusters-from-bigrams -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
> 20 -cl
>
> --
> We dont beat the reaper by living longer. We beat the reaper by living well
> and living fully. The reaper will come for all of us. Question is, what do
> we do between the time we are born and the time he shows up? -Randy Pausch
>



-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Reply via email to