I agree something is amiss here, but it could be the model is just not suitable for this problem. Running with the Reuters dataset, I see all the points being assigned to C-0 in the very first iteration as you do. I think the problem is with the pdf() calculations in the mapper for very wide vectors such as we are using. For smaller dimension vectors, DPC appears to be working great.
I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK and DPC so we can both use the same platform. I will report more progress as I dig in deeper today... -----Original Message----- From: edward choi [mailto:[email protected]] Sent: Wednesday, October 19, 2011 8:11 AM To: [email protected] Subject: Re: Dirichlet Process Clustering not working Okay, I've just tried DPC with reuters document set. I let the 'build-reuters.sh' create the sequence files and vectors. (From the looks of its dictionary generated by mahout, the number of features seemed to be less than 100,000) Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha, clustering true, no addtional options) Below is the result of the clusterdump of clusters-10 ---------------------------------------------------------------------------------------------------------------------------- C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002, 0.05:0.004, 0.07:0.005, 0.07 Top Terms: said => 1.6577128281476725 mln => 1.2455441154347937 dlrs => 1.1173752482257673 3 => 1.042824193090437 pct => 1.0223684722334667 reuter => 0.9934255143959358 C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711, 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10: Top Terms:.... C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672, 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0 Top Terms:.... C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760, 0.05:-0.343, 0.07:0.286, 0.077:1.179, Top Terms:.... ---------------------------------------------------------------------------------------------------------------------------- I guess the same thing happened again. So the document set is not the problem. Something is definitely wrong with DPC. Interesting thing is that the first cluster point does not have a single negative value in it. Rest of the cluster points have a lot of negative values. So I guess this phenomenon has something to do with the first cluster hogging all the documents. Any comments on this result? (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another thread when I am done with that). Regards, Ed
