Check out TestClusterDumper.testDirichlet2&3 for an example of text clustering using DPC. It produces reasonable looking clusters when compared with k-means and the other algorithms, but on a small vocabulary. Also check out DisplayDirichlet, which does a great job of clustering some random 2-d data.
I'd suggest trying the default 1.0 alpha as is done in the cluster dumper tests. Also, the default model is GaussianCluster and it may not perform well with a large feature space. Check the pdf() function which uses the product of the component pdfs to produce the composite value for each cluster. This may not be optimal for really large term vectors. How many elements are in your term vectors? You may need to create your own model and model distribution to make DPC perform on your data. Jeff -----Original Message----- From: edward choi [mailto:[email protected]] Sent: Tuesday, October 18, 2011 7:06 AM To: [email protected] Subject: Dirichlet Process Clustering not working Hi, This is my first time using Mahout, though it's been over a year playing with Hadoop and Hbase. I collected several hundred thousand news articles from RSS. And I wanted to do a dirichlet process clustering(DPC) with them. I did as the mahout wiki told me to do. (Making sequence files from normal documents, then making them into vectors, and then doing DPC, then finally clusterdumping) My DPC setting was: 20 clusters. 10 iterations. 2.0 alpha. clustering true, emitMostLikely false. No modelDist, modelPrototype, distanceMeasure was specified. Number of documents were 5896. (I preprocessed the docs so that they would only contain verbs and nouns). The result was not what i had expected. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- C-0: GC:0{n=5896 c=[0:0.001, 0.07:0.000, 0.08:0.000, 0 ....................... Top Terms: comment =>0.015425061539023016 2011 =>0.011413068888273332 reserve =>0.011253999429472274 rights => 0.01115527360420605 use =>0.010942002711960384 rights reserve =>0.010882667414113879 copyright =>0.010399572042096333 publish =>0.009924242339732702 time => 0.00988611270657134 material =>0.009849842593611612 C-1: GC:1{n=0 c=[0:-0.239, 0.07:0.775, 0.08:-0.767,..... Top Terms:....... C-10: GC:10{n=0 c=[0:-1.116, 0.07:-0............ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- This is what the clusterdump looks like. To my understanding, this means that all the documents were assigned to one cluster point, namely C-0. I changed the DPC settings around. I also changed the process of making vectors a bit, but always the same result. I was so out of clue, I tried Kmeans with the exact same documents and vectors. And they worked!!! I don't know how I am supposed to understand this. I looked up google but there was no definite solution so I guess everybody else is doing fine with DPC. Please could someone tell me what I am doing wrong? (oh, and I am using standalone mode with Mahout) Regards, Ed
