Check out TestClusterDumper.testDirichlet2&3 for an example of text clustering 
using DPC. It produces reasonable looking clusters when compared with k-means 
and the other algorithms, but on a small vocabulary. Also check out 
DisplayDirichlet, which does a great job of clustering some random 2-d data. 

I'd suggest trying the default 1.0 alpha as is done in the cluster dumper 
tests. Also, the default model is GaussianCluster and it may not perform well 
with a large feature space. Check the pdf() function which uses the product of 
the component pdfs to produce the composite value for each cluster. This may 
not be optimal for really large term vectors. How many elements are in your 
term vectors? You may need to create your own model and model distribution to 
make DPC perform on your data.

Jeff

-----Original Message-----
From: edward choi [mailto:[email protected]] 
Sent: Tuesday, October 18, 2011 7:06 AM
To: [email protected]
Subject: Dirichlet Process Clustering not working

Hi,

This is my first time using Mahout, though it's been over a year playing
with Hadoop and Hbase.

I collected several hundred thousand news articles from RSS. And I wanted to
do a dirichlet process clustering(DPC) with them.
I did as the mahout wiki told me to do. (Making sequence files from normal
documents, then making them into vectors, and then doing DPC, then finally
clusterdumping)
My DPC setting was: 20 clusters. 10 iterations. 2.0 alpha. clustering true,
emitMostLikely false. No modelDist, modelPrototype, distanceMeasure was
specified.
Number of documents were 5896. (I preprocessed the docs so that they would
only contain verbs and nouns).
The result was not what i had expected.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=5896 c=[0:0.001, 0.07:0.000, 0.08:0.000, 0
.......................
    Top Terms:
        comment                                 =>0.015425061539023016
        2011                                    =>0.011413068888273332
        reserve                                 =>0.011253999429472274
        rights                                  => 0.01115527360420605
        use                                     =>0.010942002711960384
        rights reserve                          =>0.010882667414113879
        copyright                               =>0.010399572042096333
        publish                                 =>0.009924242339732702
        time                                    => 0.00988611270657134
        material                                =>0.009849842593611612

C-1: GC:1{n=0 c=[0:-0.239, 0.07:0.775, 0.08:-0.767,.....
    Top Terms:.......

C-10: GC:10{n=0 c=[0:-1.116, 0.07:-0............
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is what the clusterdump looks like. To my understanding, this means
that all the documents were assigned to one cluster point, namely C-0.
I changed the DPC settings around. I also changed the process of making
vectors a bit, but always the same result.
I was so out of clue, I tried Kmeans with the exact same documents and
vectors. And they worked!!! I don't know how I am supposed to understand
this.
I looked up google but there was no definite solution so I guess everybody
else is doing fine with DPC.

Please could someone tell me what I am doing wrong? (oh, and I am using
standalone mode with Mahout)

Regards,
Ed

Reply via email to