It would be interesting to figure out why it is still returning negative element values
On October 28, 2011 at 12:39 PM jdog <[email protected]> wrote: > Glad it's working better now. The results are about what I would expect. Some > empty clusters indicates k was set high enough to capture the important > models, > given the alpha0 setting. The large number of documents in DC-0 suggests > adjusting a0, while increasing k, could find more subtle structure within your > data. > > > > > On October 28, 2011 at 1:29 AM edward choi <[email protected]> wrote: > > > Okay, I have tested with Reuters set and the result was much better than > > testing with my news documents. > > > > I downloaded Reuters set, made it into sequence file. Then turned it into > > sparse vector with following arguments: > > --minDF 2 --maxDFPercent 50 --weight TFIDF --norm 2 -ng 2 -nv > > Then I did DPC with the same arguments you told me. > > > > The total number of documents was 21578. > > DC-0 had 11187 documents. > > Seven clusters had zero docs. > > Rest of the clusters had from 1 to 1189 docs. > > > > Very interesting thing is, DC-16,18, 19 have the exact same negative points > > as before when I did DPC with my own document set. > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > DC-16 total= 0 model= DMC:16{n=0 c=[0:-0.411, 0.003:-0.061, 0.01:1.685, > > 0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430, > > 0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274, > > 0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060, > > 0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938, > > 0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053, > > 0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,................ > > Top Terms: > > jersey based => 5.055564881106928 > > withdrew offer => 4.160793145890344 > > although said => 4.1069074456260966 > > confirmed iraqi => 4.016748531705415 > > force administration => 3.995899196620034 > > 24.6 => 3.9719147317695596 > > due mostly => 3.9125799367453267 > > unit british => 3.9048586110602286 > > trade source => 3.892495010521945 > > stevens => 3.7816279439782554 > > DC-18 total= 0 model= DMC:18{n=0 c=[0:-0.411, 0.003:-0.061, 0.01:1.685, > > 0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430, > > 0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274, > > 0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060, > > 0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938, > > 0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053, > > 0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,.............. > > Top Terms: > > jersey based => 5.055564881106928 > > withdrew offer => 4.160793145890344 > > although said => 4.1069074456260966 > > confirmed iraqi => 4.016748531705415 > > force administration => 3.995899196620034 > > 24.6 => 3.9719147317695596 > > due mostly => 3.9125799367453267 > > unit british => 3.9048586110602286 > > trade source => 3.892495010521945 > > stevens => 3.7816279439782554 > > DC-19 total= 0 model= DMC:19{n=0 c=[0:-0.411, 0.003:-0.061, 0.01:1.685, > > 0.02:-0.560, 0.025:-0.147, 0.03:-0.675, 0.04:-0.234, 0.05:-0.430, > > 0.06:0.451, 0.07:0.186, 0.073:-0.799, 0.077:0.724, 0.1:0.731, 0.10:2.274, > > 0.11:-0.739, 0.12:0.660, 0.127:1.546, 0.13:0.907, 0.139:0.839, 0.14:-0.060, > > 0.15:0.006, 0.16:0.294, 0.163:-0.458, 0.17:0.057, 0.18:0.173, 0.185:0.938, > > 0.19:-1.340, 0.194:-0.597, 0.2:0.311, 0.20:-0.318, 0.206:-0.053, > > 0.21:-0.198, 0.2125:-1.851, 0.22:-0.604,........... > > Top Terms: > > jersey based => 5.055564881106928 > > withdrew offer => 4.160793145890344 > > although said => 4.1069074456260966 > > confirmed iraqi => 4.016748531705415 > > force administration => 3.995899196620034 > > 24.6 => 3.9719147317695596 > > due mostly => 3.9125799367453267 > > unit british => 3.9048586110602286 > > trade source => 3.892495010521945 > > stevens => 3.7816279439782554 > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > So I'm guessing there is some kind of algorithmic problem since the test > > sets were different but the same DC-16,18,19 have the same values? > > > > Regards, > > Ed > > > > 2011/10/28 edward choi <[email protected]> > > > > > > > > I downloaded the most recent version of Mahout from apache SVN. > > > Using the new arguments, I have tested DPC on my own news documents. (not > > > reuters set) > > > > > > Turns out, it really had great improvements. First of all, documents are > > > somewhat distributed across 20 clusters. > > > The total number of documents were 5896. > > > DC-0 had 1014 documents. DC-1 had 4305 documents. > > > Nine clusters had zero documents. Rest of the clusters had from 1 to 214 > > > documents each. > > > > > > The quality of the clusters weren't so pretty but I guess that has got to > > > do with the crude preprocessing step. (raw news documents have links, ads, > > > reader comments, etc. etc. etc) > > > I will know better when I test with build-reuters.sh > > > > > > One more thing. Unfortunately there are still some negative values in the > > > cluster points. > > > > > > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > DC-16 total= 0 model= DMC:16{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327, > > > 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, > > > 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, > > > Top Terms: > > > kodak camera => 4.5009259007672835 > > > player july => 4.216287519075373 > > > figure mix => 4.139826527167421 > > > department defense => 4.009974576583582 > > > remark wednesday => 3.9945681051149564 > > > counsel infection => 3.886000915158471 > > > jefferson county => 3.8442975919513667 > > > jersey say => 3.7821696224124786 > > > tell couple => 3.7644857721992415 > > > 3.5 million => 3.743525174300145 > > > DC-18 total= 0 model= DMC:18{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327, > > > 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, > > > 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, > > > Top Terms: > > > kodak camera => 4.5009259007672835 > > > player july => 4.216287519075373 > > > figure mix => 4.139826527167421 > > > department defense => 4.009974576583582 > > > remark wednesday => 3.9945681051149564 > > > counsel infection => 3.886000915158471 > > > jefferson county => 3.8442975919513667 > > > jersey say => 3.7821696224124786 > > > tell couple => 3.7644857721992415 > > > 3.5 million => 3.743525174300145 > > > DC-19 total= 0 model= DMC:19{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327, > > > 0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792, > > > 0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973, > > > Top Terms: > > > kodak camera => 4.5009259007672835 > > > player july => 4.216287519075373 > > > figure mix => 4.139826527167421 > > > department defense => 4.009974576583582 > > > remark wednesday => 3.9945681051149564 > > > counsel infection => 3.886000915158471 > > > jefferson county => 3.8442975919513667 > > > jersey say => 3.7821696224124786 > > > tell couple => 3.7644857721992415 > > > 3.5 million => 3.743525174300145 > > > > > > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > Among nine clusters which have zero members, above three have negative > > > values. > > > Interestingly, all three of them have the exact same values and top terms. > > > I wonder what this means. > > > > > > Anyway I'll post another thread when I have played around with Reuters > > > set. > > > > > > Ed > > > > > > ps. The runtime has indeed reduced significantly!!! Possibly 100 times > > > faster as you said. Loved it!! > > > > > > 2011/10/20 Jeff Eastman <[email protected]> > > > > > >> R1186452 commits two small changes that seem to do much better with > > >> Reuters than before: > > >> - fixed DistanceMeasureClusterDistribution to generate Gaussian element > > >> values in the prior clusters. Zero values in previous implementation > > >> don't > > >> work with CosineDistanceMeasure. > > >> - changed Dirichlet arguments to use DMCD and CosineDM in > > >> build-reuters.sh > > >> - switched -mp to DenseVector since all the prior center elements are > > >> Gaussian and generally non-zero > > >> - increased -a0 to 2 > > >> > > >> Build-reuters now does a much better job with the wide topic vectors > > >> using > > >> the DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new > > >> arguments: > > >> > > >> $MAHOUT dirichlet \ > > >> -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \ > > >> -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \ > > >> -md > > >> org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution > > >> \ > > >> -mp org.apache.mahout.math.DenseVector \ > > >> -dm org.apache.mahout.common.distance.CosineDistanceMeasure > > >> > > >> > > >> -----Original Message----- > > >> From: Jeff Eastman [mailto:[email protected]] > > >> Sent: Wednesday, October 19, 2011 9:53 AM > > >> To: [email protected] > > >> Subject: RE: Dirichlet Process Clustering not working > > >> > > >> The pdf() implementation in GaussianCluster is pretty lame. It is > > >> computing a running product of the element pdfs which, for wide input > > >> vectors (Reuters is 41,807), always underflows and returns 0. Here's the > > >> code: > > >> > > >> public double pdf(VectorWritable vw) { > > >> Vector x = vw.get(); > > >> // return the product of the component pdfs > > >> // TODO: is this reasonable? correct? It seems to work in some cases. > > >> double pdf = 1; > > >> for (int i = 0; i < x.size(); i++) { > > >> // small prior on stdDev to avoid numeric instability when stdDev==0 > > >> pdf *= UncommonDistributions.dNorm(x.getQuick(i), > > >> getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001); > > >> } > > >> return pdf; > > >> } > > >> > > >> -----Original Message----- > > >> From: Jeff Eastman [mailto:[email protected]] > > >> Sent: Wednesday, October 19, 2011 9:04 AM > > >> To: [email protected] > > >> Subject: RE: Dirichlet Process Clustering not working > > >> > > >> I agree something is amiss here, but it could be the model is just not > > >> suitable for this problem. Running with the Reuters dataset, I see all > > >> the > > >> points being assigned to C-0 in the very first iteration as you do. I > > >> think > > >> the problem is with the pdf() calculations in the mapper for very wide > > >> vectors such as we are using. For smaller dimension vectors, DPC appears > > >> to > > >> be working great. > > >> > > >> I'm going to commit the build-reuters.sh enhancements I've added for > > >> FuzzyK and DPC so we can both use the same platform. I will report more > > >> progress as I dig in deeper today... > > >> > > >> -----Original Message----- > > >> From: edward choi [mailto:[email protected]] > > >> Sent: Wednesday, October 19, 2011 8:11 AM > > >> To: [email protected] > > >> Subject: Re: Dirichlet Process Clustering not working > > >> > > >> Okay, I've just tried DPC with reuters document set. > > >> I let the 'build-reuters.sh' create the sequence files and vectors. (From > > >> the looks of its dictionary generated by mahout, the number of features > > >> seemed to be less than 100,000) > > >> Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha, > > >> clustering true, no addtional options) > > >> Below is the result of the clusterdump of clusters-10 > > >> > > >> ---------------------------------------------------------------------------------------------------------------------------- > > >> C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002, > > >> 0.05:0.004, 0.07:0.005, 0.07 > > >> Top Terms: > > >> said => 1.6577128281476725 > > >> mln => 1.2455441154347937 > > >> dlrs => 1.1173752482257673 > > >> 3 => 1.042824193090437 > > >> pct => 1.0223684722334667 > > >> reuter => 0.9934255143959358 > > >> C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711, > > >> 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10: > > >> Top Terms:.... > > >> C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672, > > >> 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0 > > >> Top Terms:.... > > >> C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760, > > >> 0.05:-0.343, 0.07:0.286, 0.077:1.179, > > >> Top Terms:.... > > >> > > >> ---------------------------------------------------------------------------------------------------------------------------- > > >> I guess the same thing happened again. So the document set is not the > > >> problem. Something is definitely wrong with DPC. > > >> Interesting thing is that the first cluster point does not have a single > > >> negative value in it. > > >> Rest of the cluster points have a lot of negative values. So I guess this > > >> phenomenon has something to do with the first cluster hogging all the > > >> documents. > > >> Any comments on this result? > > >> (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post > > >> another > > >> thread when I am done with that). > > >> > > >> Regards, > > >> Ed > > >> > > >> > > >> > > >
