RE: Dirichlet Process Clustering not working

Jeff Eastman Wed, 19 Oct 2011 09:04:51 -0700

I agree something is amiss here, but it could be the model is just not suitable 
for this problem. Running with the Reuters dataset, I see all the points being 
assigned to C-0 in the very first iteration as you do. I think the problem is 
with the pdf() calculations in the mapper for very wide vectors such as we are 
using. For smaller dimension vectors, DPC appears to be working great.


I'm going to commit the build-reuters.sh enhancements I've added for FuzzyK and 
DPC so we can both use the same platform. I will report more progress as I dig 
in deeper today...

-----Original Message-----
From: edward choi [mailto:[email protected]] 
Sent: Wednesday, October 19, 2011 8:11 AM
To: [email protected]
Subject: Re: Dirichlet Process Clustering not working

Okay, I've just tried DPC with reuters document set.
I let the 'build-reuters.sh' create the sequence files and vectors. (From
the looks of its dictionary generated by mahout, the number of features
seemed to be less than 100,000)
Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
clustering true, no addtional options)
Below is the result of the clusterdump of clusters-10
----------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
0.05:0.004, 0.07:0.005, 0.07
    Top Terms:
        said                                    =>  1.6577128281476725
        mln                                     =>  1.2455441154347937
        dlrs                                    =>  1.1173752482257673
        3                                       =>   1.042824193090437
        pct                                     =>  1.0223684722334667
        reuter                                  =>  0.9934255143959358
C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
    Top Terms:....
C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
    Top Terms:....
C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
0.05:-0.343, 0.07:0.286, 0.077:1.179,
    Top Terms:....
----------------------------------------------------------------------------------------------------------------------------
I guess the same thing happened again. So the document set is not the
problem. Something is definitely wrong with DPC.
Interesting thing is that the first cluster point does not have a single
negative value in it.
Rest of the cluster points have a lot of negative values. So I guess this
phenomenon has something to do with the first cluster hogging all the
documents.
Any comments on this result?
(I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
thread when I am done with that).

Regards,
Ed

RE: Dirichlet Process Clustering not working

Reply via email to