I did do DPC with alpha 1.0 with no luck. Then I tried with alpha 2.0, still no luck. I doubt that it is a problem related to parameter setting.
I don't know the exact number, but I am pretty sure that the number of features of my document vectors are easily over 150,000. I wanted to use all numerical figures, and all kinds of nouns and verbs. I normalized the nouns and verbs but they should exceed at least 100,000. I guess that is too large a number of features. (FYI, I set maxDFPercent 50 when making vectors) I'll give TestClusterDumper.testDirichlet a try. And I definitely should test with the reuters document set also. See if there is any difference than using my document set. Thanks for the advice. I'll make a post when done testing. Regards, Ed 2011/10/19 Jeff Eastman <[email protected]> > Check out TestClusterDumper.testDirichlet2&3 for an example of text > clustering using DPC. It produces reasonable looking clusters when compared > with k-means and the other algorithms, but on a small vocabulary. Also check > out DisplayDirichlet, which does a great job of clustering some random 2-d > data. > > I'd suggest trying the default 1.0 alpha as is done in the cluster dumper > tests. Also, the default model is GaussianCluster and it may not perform > well with a large feature space. Check the pdf() function which uses the > product of the component pdfs to produce the composite value for each > cluster. This may not be optimal for really large term vectors. How many > elements are in your term vectors? You may need to create your own model and > model distribution to make DPC perform on your data. > > Jeff > > -----Original Message----- > From: edward choi [mailto:[email protected]] > Sent: Tuesday, October 18, 2011 7:06 AM > To: [email protected] > Subject: Dirichlet Process Clustering not working > > Hi, > > This is my first time using Mahout, though it's been over a year playing > with Hadoop and Hbase. > > I collected several hundred thousand news articles from RSS. And I wanted > to > do a dirichlet process clustering(DPC) with them. > I did as the mahout wiki told me to do. (Making sequence files from normal > documents, then making them into vectors, and then doing DPC, then finally > clusterdumping) > My DPC setting was: 20 clusters. 10 iterations. 2.0 alpha. clustering true, > emitMostLikely false. No modelDist, modelPrototype, distanceMeasure was > specified. > Number of documents were 5896. (I preprocessed the docs so that they would > only contain verbs and nouns). > The result was not what i had expected. > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > C-0: GC:0{n=5896 c=[0:0.001, 0.07:0.000, 0.08:0.000, 0 > ....................... > Top Terms: > comment =>0.015425061539023016 > 2011 =>0.011413068888273332 > reserve =>0.011253999429472274 > rights => 0.01115527360420605 > use =>0.010942002711960384 > rights reserve =>0.010882667414113879 > copyright =>0.010399572042096333 > publish =>0.009924242339732702 > time => 0.00988611270657134 > material =>0.009849842593611612 > > C-1: GC:1{n=0 c=[0:-0.239, 0.07:0.775, 0.08:-0.767,..... > Top Terms:....... > > C-10: GC:10{n=0 c=[0:-1.116, 0.07:-0............ > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > This is what the clusterdump looks like. To my understanding, this means > that all the documents were assigned to one cluster point, namely C-0. > I changed the DPC settings around. I also changed the process of making > vectors a bit, but always the same result. > I was so out of clue, I tried Kmeans with the exact same documents and > vectors. And they worked!!! I don't know how I am supposed to understand > this. > I looked up google but there was no definite solution so I guess everybody > else is doing fine with DPC. > > Please could someone tell me what I am doing wrong? (oh, and I am using > standalone mode with Mahout) > > Regards, > Ed >
