R1186452 commits two small changes that seem to do much better with Reuters
than before:
- fixed DistanceMeasureClusterDistribution to generate Gaussian element values
in the prior clusters. Zero values in previous implementation don't work with
CosineDistanceMeasure.
- changed Dirichlet arguments to use DMCD and CosineDM in build-reuters.sh
- switched -mp to DenseVector since all the prior center elements are Gaussian
and generally non-zero
- increased -a0 to 2
Build-reuters now does a much better job with the wide topic vectors using the
DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new arguments:
$MAHOUT dirichlet \
-i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \
-o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \
-md
org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution
\
-mp org.apache.mahout.math.DenseVector \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
-----Original Message-----
From: Jeff Eastman [mailto:[email protected]]
Sent: Wednesday, October 19, 2011 9:53 AM
To: [email protected]
Subject: RE: Dirichlet Process Clustering not working
The pdf() implementation in GaussianCluster is pretty lame. It is computing a
running product of the element pdfs which, for wide input vectors (Reuters is
41,807), always underflows and returns 0. Here's the code:
public double pdf(VectorWritable vw) {
Vector x = vw.get();
// return the product of the component pdfs
// TODO: is this reasonable? correct? It seems to work in some cases.
double pdf = 1;
for (int i = 0; i < x.size(); i++) {
// small prior on stdDev to avoid numeric instability when stdDev==0
pdf *= UncommonDistributions.dNorm(x.getQuick(i),
getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
}
return pdf;
}
-----Original Message-----
From: Jeff Eastman [mailto:[email protected]]
Sent: Wednesday, October 19, 2011 9:04 AM
To: [email protected]
Subject: RE: Dirichlet Process Clustering not working
I agree something is amiss here, but it could be the model is just not suitable
for this problem. Running with the Reuters dataset, I see all the points being
assigned to C-0 in the very first iteration as you do. I think the problem is
with the pdf() calculations in the mapper for very wide vectors such as we are
using. For smaller dimension vectors, DPC appears to be working great.
I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK and
DPC so we can both use the same platform. I will report more progress as I dig
in deeper today...
-----Original Message-----
From: edward choi [mailto:[email protected]]
Sent: Wednesday, October 19, 2011 8:11 AM
To: [email protected]
Subject: Re: Dirichlet Process Clustering not working
Okay, I've just tried DPC with reuters document set.
I let the 'build-reuters.sh' create the sequence files and vectors. (From
the looks of its dictionary generated by mahout, the number of features
seemed to be less than 100,000)
Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
clustering true, no addtional options)
Below is the result of the clusterdump of clusters-10
----------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
0.05:0.004, 0.07:0.005, 0.07
Top Terms:
said => 1.6577128281476725
mln => 1.2455441154347937
dlrs => 1.1173752482257673
3 => 1.042824193090437
pct => 1.0223684722334667
reuter => 0.9934255143959358
C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
Top Terms:....
C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
Top Terms:....
C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
0.05:-0.343, 0.07:0.286, 0.077:1.179,
Top Terms:....
----------------------------------------------------------------------------------------------------------------------------
I guess the same thing happened again. So the document set is not the
problem. Something is definitely wrong with DPC.
Interesting thing is that the first cluster point does not have a single
negative value in it.
Rest of the cluster points have a lot of negative values. So I guess this
phenomenon has something to do with the first cluster hogging all the
documents.
Any comments on this result?
(I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
thread when I am done with that).
Regards,
Ed