R1186452 commits two small changes that seem to do much better with Reuters 
than before:
- fixed DistanceMeasureClusterDistribution to generate Gaussian element values 
in the prior clusters. Zero values in previous implementation don't work with 
CosineDistanceMeasure.
- changed Dirichlet arguments to use DMCD and CosineDM in build-reuters.sh
- switched -mp to DenseVector since all the prior center elements are Gaussian 
and generally non-zero
- increased -a0 to 2

Build-reuters now does a much better job with the wide topic vectors using the 
DMCD/CosineDM. And it runs maybe 100x faster too. Here are the new arguments:

  $MAHOUT dirichlet \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-dirichlet/tfidf-vectors \
    -o ${WORK_DIR}/reuters-dirichlet -k 20 -ow -x 10 -a0 2 \
    -md 
org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution
 \
    -mp org.apache.mahout.math.DenseVector \
    -dm org.apache.mahout.common.distance.CosineDistanceMeasure


-----Original Message-----
From: Jeff Eastman [mailto:[email protected]] 
Sent: Wednesday, October 19, 2011 9:53 AM
To: [email protected]
Subject: RE: Dirichlet Process Clustering not working

The pdf() implementation in GaussianCluster is pretty lame. It is computing a 
running product of the element pdfs which, for wide input vectors (Reuters is 
41,807), always underflows and returns 0. Here's the code:

  public double pdf(VectorWritable vw) {
    Vector x = vw.get();
    // return the product of the component pdfs
    // TODO: is this reasonable? correct? It seems to work in some cases.
    double pdf = 1;
    for (int i = 0; i < x.size(); i++) {
      // small prior on stdDev to avoid numeric instability when stdDev==0
      pdf *= UncommonDistributions.dNorm(x.getQuick(i),
          getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
    }
    return pdf;
  }

-----Original Message-----
From: Jeff Eastman [mailto:[email protected]] 
Sent: Wednesday, October 19, 2011 9:04 AM
To: [email protected]
Subject: RE: Dirichlet Process Clustering not working

I agree something is amiss here, but it could be the model is just not suitable 
for this problem. Running with the Reuters dataset, I see all the points being 
assigned to C-0 in the very first iteration as you do. I think the problem is 
with the pdf() calculations in the mapper for very wide vectors such as we are 
using. For smaller dimension vectors, DPC appears to be working great. 

I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK and 
DPC so we can both use the same platform. I will report more progress as I dig 
in deeper today...

-----Original Message-----
From: edward choi [mailto:[email protected]] 
Sent: Wednesday, October 19, 2011 8:11 AM
To: [email protected]
Subject: Re: Dirichlet Process Clustering not working

Okay, I've just tried DPC with reuters document set.
I let the 'build-reuters.sh' create the sequence files and vectors. (From
the looks of its dictionary generated by mahout, the number of features
seemed to be less than 100,000)
Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
clustering true, no addtional options)
Below is the result of the clusterdump of clusters-10
----------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
0.05:0.004, 0.07:0.005, 0.07
    Top Terms:
        said                                    =>  1.6577128281476725
        mln                                     =>  1.2455441154347937
        dlrs                                    =>  1.1173752482257673
        3                                       =>   1.042824193090437
        pct                                     =>  1.0223684722334667
        reuter                                  =>  0.9934255143959358
C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
    Top Terms:....
C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
    Top Terms:....
C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
0.05:-0.343, 0.07:0.286, 0.077:1.179,
    Top Terms:....
----------------------------------------------------------------------------------------------------------------------------
I guess the same thing happened again. So the document set is not the
problem. Something is definitely wrong with DPC.
Interesting thing is that the first cluster point does not have a single
negative value in it.
Rest of the cluster points have a lot of negative values. So I guess this
phenomenon has something to do with the first cluster hogging all the
documents.
Any comments on this result?
(I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
thread when I am done with that).

Regards,
Ed


Reply via email to