The trick was switching to the distance measure model. The default Gaussian model was doing complicated math for each point and for each of the thousands of dimensions and for each cluster. Then it multiplied all the term pdfs together and underflowed! 100x improvement seems about right. Glad it is working so well. I figured it could be coaxed to do so. I'm still concerned about your negative term weights. Is this coming from your dataset?
-----Original Message----- From: edward choi [mailto:[email protected]] Sent: Thursday, October 27, 2011 9:57 AM To: [email protected] Subject: Re: Dirichlet Process Clustering not working I downloaded the most recent version of Mahout from apache SVN. Using the new arguments, I have tested DPC on my own news documents. (not reuters set) Turns out, it really had great improvements. First of all, documents are somewhat distributed across 20 clusters. The total number of documents were 5896. DC-0 had 1014 documents. DC-1 had 4305 documents. Nine clusters had zero documents. Rest of the clusters had from 1 to 214 documents each. The quality of the clusters weren't so pretty but I guess that has got to do with the crude preprocessing step. (raw news documents have links, ads, reader comments, etc. etc. etc) I will know better when I test with build-reuters.sh One more thing. Unfortunately there are still some negative values in the cluster points. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- DC-16 total= 0 model= DMC:16{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327, 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, Top Terms: kodak camera => 4.5009259007672835 player july => 4.216287519075373 figure mix => 4.139826527167421 department defense => 4.009974576583582 remark wednesday => 3.9945681051149564 counsel infection => 3.886000915158471 jefferson county => 3.8442975919513667 jersey say => 3.7821696224124786 tell couple => 3.7644857721992415 3.5 million => 3.743525174300145 DC-18 total= 0 model= DMC:18{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327, 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, Top Terms: kodak camera => 4.5009259007672835 player july => 4.216287519075373 figure mix => 4.139826527167421 department defense => 4.009974576583582 remark wednesday => 3.9945681051149564 counsel infection => 3.886000915158471 jefferson county => 3.8442975919513667 jersey say => 3.7821696224124786 tell couple => 3.7644857721992415 3.5 million => 3.743525174300145 DC-19 total= 0 model= DMC:19{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327, 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, Top Terms: kodak camera => 4.5009259007672835 player july => 4.216287519075373 figure mix => 4.139826527167421 department defense => 4.009974576583582 remark wednesday => 3.9945681051149564 counsel infection => 3.886000915158471 jefferson county => 3.8442975919513667 jersey say => 3.7821696224124786 tell couple => 3.7644857721992415 3.5 million => 3.743525174300145 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Among nine clusters which have zero members, above three have negative values. Interestingly, all three of them have the exact same values and top terms. I wonder what this means. Anyway I'll post another thread when I have played around with Reuters set. Ed ps. The runtime has indeed reduced significantly!!! Possibly 100 times faster as you said. Loved it!! 2011/10/20 Jeff Eastman <[email protected]> > R1186452 commits two small changes that seem to do much better with Reuters > than before: > - fixed DistanceMeasureClusterDistribution to generate Gaussian element > values in the prior clusters. Zero values in previous implementation don't > work with CosineDistanceMeasure. > - changed Dirichlet arguments to use DMCD and CosineDM in build-reuters.sh > - switched -mp to DenseVector since all the prior center elements are > Gaussian and generally non-zero > - increased -a0 to 2 > > Build-reuters now does a much better job with the wide topic vectors using > the DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new > arguments: > > $MAHOUT dirichlet \ > -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \ > -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \ > -md > org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution > \ > -mp org.apache.mahout.math.DenseVector \ > -dm org.apache.mahout.common.distance.CosineDistanceMeasure > > > -----Original Message----- > From: Jeff Eastman [mailto:[email protected]] > Sent: Wednesday, October 19, 2011 9:53 AM > To: [email protected] > Subject: RE: Dirichlet Process Clustering not working > > The pdf() implementation in GaussianCluster is pretty lame. It is computing > a running product of the element pdfs which, for wide input vectors (Reuters > is 41,807), always underflows and returns 0. Here's the code: > > public double pdf(VectorWritable vw) { > Vector x = vw.get(); > // return the product of the component pdfs > // TODO: is this reasonable? correct? It seems to work in some cases. > double pdf = 1; > for (int i = 0; i < x.size(); i++) { > // small prior on stdDev to avoid numeric instability when stdDev==0 > pdf *= UncommonDistributions.dNorm(x.getQuick(i), > getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001); > } > return pdf; > } > > -----Original Message----- > From: Jeff Eastman [mailto:[email protected]] > Sent: Wednesday, October 19, 2011 9:04 AM > To: [email protected] > Subject: RE: Dirichlet Process Clustering not working > > I agree something is amiss here, but it could be the model is just not > suitable for this problem. Running with the Reuters dataset, I see all the > points being assigned to C-0 in the very first iteration as you do. I think > the problem is with the pdf() calculations in the mapper for very wide > vectors such as we are using. For smaller dimension vectors, DPC appears to > be working great. > > I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK > and DPC so we can both use the same platform. I will report more progress as > I dig in deeper today... > > -----Original Message----- > From: edward choi [mailto:[email protected]] > Sent: Wednesday, October 19, 2011 8:11 AM > To: [email protected] > Subject: Re: Dirichlet Process Clustering not working > > Okay, I've just tried DPC with reuters document set. > I let the 'build-reuters.sh' create the sequence files and vectors. (From > the looks of its dictionary generated by mahout, the number of features > seemed to be less than 100,000) > Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha, > clustering true, no addtional options) > Below is the result of the clusterdump of clusters-10 > > ---------------------------------------------------------------------------------------------------------------------------- > C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002, > 0.05:0.004, 0.07:0.005, 0.07 > Top Terms: > said => 1.6577128281476725 > mln => 1.2455441154347937 > dlrs => 1.1173752482257673 > 3 => 1.042824193090437 > pct => 1.0223684722334667 > reuter => 0.9934255143959358 > C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711, > 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10: > Top Terms:.... > C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672, > 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0 > Top Terms:.... > C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760, > 0.05:-0.343, 0.07:0.286, 0.077:1.179, > Top Terms:.... > > ---------------------------------------------------------------------------------------------------------------------------- > I guess the same thing happened again. So the document set is not the > problem. Something is definitely wrong with DPC. > Interesting thing is that the first cluster point does not have a single > negative value in it. > Rest of the cluster points have a lot of negative values. So I guess this > phenomenon has something to do with the first cluster hogging all the > documents. > Any comments on this result? > (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another > thread when I am done with that). > > Regards, > Ed > > >
