I downloaded the most recent version of Mahout from apache SVN.
Using the new arguments, I have tested DPC on my own news documents. (not
reuters set)
Turns out, it really had great improvements. First of all, documents are
somewhat distributed across 20 clusters.
The total number of documents were 5896.
DC-0 had 1014 documents. DC-1 had 4305 documents.
Nine clusters had zero documents. Rest of the clusters had from 1 to 214
documents each.
The quality of the clusters weren't so pretty but I guess that has got to do
with the crude preprocessing step. (raw news documents have links, ads,
reader comments, etc. etc. etc)
I will know better when I test with build-reuters.sh
One more thing. Unfortunately there are still some negative values in the
cluster points.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DC-16 total= 0 model= DMC:16{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
Top Terms:
kodak camera => 4.5009259007672835
player july => 4.216287519075373
figure mix => 4.139826527167421
department defense => 4.009974576583582
remark wednesday => 3.9945681051149564
counsel infection => 3.886000915158471
jefferson county => 3.8442975919513667
jersey say => 3.7821696224124786
tell couple => 3.7644857721992415
3.5 million => 3.743525174300145
DC-18 total= 0 model= DMC:18{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
Top Terms:
kodak camera => 4.5009259007672835
player july => 4.216287519075373
figure mix => 4.139826527167421
department defense => 4.009974576583582
remark wednesday => 3.9945681051149564
counsel infection => 3.886000915158471
jefferson county => 3.8442975919513667
jersey say => 3.7821696224124786
tell couple => 3.7644857721992415
3.5 million => 3.743525174300145
DC-19 total= 0 model= DMC:19{n=0 c=[0:-1.093, 0.07:-0.891, 0.08:1.327,
0.1:0.504, 0.18:-0.705, 0.2:0.318, 0.25:1.824, 0.3:0.273, 0.32:-0.792,
0.4:0.390, 0.41:-1.314, 0.5:0.727, 0.7:0.734, 0.70:-0.973,
Top Terms:
kodak camera => 4.5009259007672835
player july => 4.216287519075373
figure mix => 4.139826527167421
department defense => 4.009974576583582
remark wednesday => 3.9945681051149564
counsel infection => 3.886000915158471
jefferson county => 3.8442975919513667
jersey say => 3.7821696224124786
tell couple => 3.7644857721992415
3.5 million => 3.743525174300145
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Among nine clusters which have zero members, above three have negative
values.
Interestingly, all three of them have the exact same values and top terms. I
wonder what this means.
Anyway I'll post another thread when I have played around with Reuters set.
Ed
ps. The runtime has indeed reduced significantly!!! Possibly 100 times
faster as you said. Loved it!!
2011/10/20 Jeff Eastman <[email protected]>
> R1186452 commits two small changes that seem to do much better with Reuters
> than before:
> - fixed DistanceMeasureClusterDistribution to generate Gaussian element
> values in the prior clusters. Zero values in previous implementation don't
> work with CosineDistanceMeasure.
> - changed Dirichlet arguments to use DMCD and CosineDM in build-reuters.sh
> - switched -mp to DenseVector since all the prior center elements are
> Gaussian and generally non-zero
> - increased -a0 to 2
>
> Build-reuters now does a much better job with the wide topic vectors using
> the DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new
> arguments:
>
> $MAHOUT dirichlet \
> -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \
> -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \
> -md
> org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution
> \
> -mp org.apache.mahout.math.DenseVector \
> -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>
>
> -----Original Message-----
> From: Jeff Eastman [mailto:[email protected]]
> Sent: Wednesday, October 19, 2011 9:53 AM
> To: [email protected]
> Subject: RE: Dirichlet Process Clustering not working
>
> The pdf() implementation in GaussianCluster is pretty lame. It is computing
> a running product of the element pdfs which, for wide input vectors (Reuters
> is 41,807), always underflows and returns 0. Here's the code:
>
> public double pdf(VectorWritable vw) {
> Vector x = vw.get();
> // return the product of the component pdfs
> // TODO: is this reasonable? correct? It seems to work in some cases.
> double pdf = 1;
> for (int i = 0; i < x.size(); i++) {
> // small prior on stdDev to avoid numeric instability when stdDev==0
> pdf *= UncommonDistributions.dNorm(x.getQuick(i),
> getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
> }
> return pdf;
> }
>
> -----Original Message-----
> From: Jeff Eastman [mailto:[email protected]]
> Sent: Wednesday, October 19, 2011 9:04 AM
> To: [email protected]
> Subject: RE: Dirichlet Process Clustering not working
>
> I agree something is amiss here, but it could be the model is just not
> suitable for this problem. Running with the Reuters dataset, I see all the
> points being assigned to C-0 in the very first iteration as you do. I think
> the problem is with the pdf() calculations in the mapper for very wide
> vectors such as we are using. For smaller dimension vectors, DPC appears to
> be working great.
>
> I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK
> and DPC so we can both use the same platform. I will report more progress as
> I dig in deeper today...
>
> -----Original Message-----
> From: edward choi [mailto:[email protected]]
> Sent: Wednesday, October 19, 2011 8:11 AM
> To: [email protected]
> Subject: Re: Dirichlet Process Clustering not working
>
> Okay, I've just tried DPC with reuters document set.
> I let the 'build-reuters.sh' create the sequence files and vectors. (From
> the looks of its dictionary generated by mahout, the number of features
> seemed to be less than 100,000)
> Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
> clustering true, no addtional options)
> Below is the result of the clusterdump of clusters-10
>
> ----------------------------------------------------------------------------------------------------------------------------
> C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
> 0.05:0.004, 0.07:0.005, 0.07
> Top Terms:
> said => 1.6577128281476725
> mln => 1.2455441154347937
> dlrs => 1.1173752482257673
> 3 => 1.042824193090437
> pct => 1.0223684722334667
> reuter => 0.9934255143959358
> C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
> 0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
> Top Terms:....
> C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
> 0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
> Top Terms:....
> C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
> 0.05:-0.343, 0.07:0.286, 0.077:1.179,
> Top Terms:....
>
> ----------------------------------------------------------------------------------------------------------------------------
> I guess the same thing happened again. So the document set is not the
> problem. Something is definitely wrong with DPC.
> Interesting thing is that the first cluster point does not have a single
> negative value in it.
> Rest of the cluster points have a lot of negative values. So I guess this
> phenomenon has something to do with the first cluster hogging all the
> documents.
> Any comments on this result?
> (I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
> thread when I am done with that).
>
> Regards,
> Ed
>
>
>