Okay, I've just tried DPC with reuters document set.
I let the 'build-reuters.sh' create the sequence files and vectors. (From
the looks of its dictionary generated by mahout, the number of features
seemed to be less than 100,000)
Then I used them to do DPC. (15 clusters, 10 iteration, 1.0 alpha,
clustering true, no addtional options)
Below is the result of the clusterdump of clusters-10
----------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=15745 c=[0:0.026, 0.003:0.001, 0.01:0.004, 0.02:0.002,
0.05:0.004, 0.07:0.005, 0.07
    Top Terms:
        said                                    =>  1.6577128281476725
        mln                                     =>  1.2455441154347937
        dlrs                                    =>  1.1173752482257673
        3                                       =>   1.042824193090437
        pct                                     =>  1.0223684722334667
        reuter                                  =>  0.9934255143959358
C-1: GC:1{n=0 c=[0:-0.595, 0.003:0.228, 0.01:-0.401, 0.02:-0.711,
0.05:1.840, 0.07:0.136, 0.077:-0.739, 0.1:-0.177, 0.10:
    Top Terms:....
C-10: GC:10{n=0 c=[0:0.090, 0.003:-1.426, 0.01:-0.472, 0.02:0.672,
0.05:0.800, 0.07:0.691, 0.077:1.037, 0.1:0
    Top Terms:....
C-11: GC:11{n=0 c=[0:-0.835, 0.003:-1.748, 0.01:-1.030, 0.02:-1.760,
0.05:-0.343, 0.07:0.286, 0.077:1.179,
    Top Terms:....
----------------------------------------------------------------------------------------------------------------------------
I guess the same thing happened again. So the document set is not the
problem. Something is definitely wrong with DPC.
Interesting thing is that the first cluster point does not have a single
negative value in it.
Rest of the cluster points have a lot of negative values. So I guess this
phenomenon has something to do with the first cluster hogging all the
documents.
Any comments on this result?
(I haven't tried TestClusterDumper.testDirichlet2&3 yet. I'll post another
thread when I am done with that).

Regards,
Ed

2011/10/19 edward choi <[email protected]>

>
> I did do DPC with alpha 1.0 with no luck. Then I tried with alpha 2.0,
> still no luck. I doubt that it is a problem related to parameter setting.
>
> I don't know the exact number, but I am pretty sure that the number of
> features of my document vectors are easily over 150,000.
> I wanted to use all numerical figures, and all kinds of nouns and verbs. I
> normalized the nouns and verbs but they should exceed at least 100,000.
> I guess that is too large a number of features. (FYI, I set maxDFPercent 50
> when making vectors)
>
> I'll give TestClusterDumper.testDirichlet a try. And I definitely should
> test with the reuters document set also. See if there is any difference than
> using my document set.
> Thanks for the advice. I'll make a post when done testing.
>
> Regards,
> Ed
>
> 2011/10/19 Jeff Eastman <[email protected]>
>
>> Check out TestClusterDumper.testDirichlet2&3 for an example of text
>> clustering using DPC. It produces reasonable looking clusters when compared
>> with k-means and the other algorithms, but on a small vocabulary. Also check
>> out DisplayDirichlet, which does a great job of clustering some random 2-d
>> data.
>>
>> I'd suggest trying the default 1.0 alpha as is done in the cluster dumper
>> tests. Also, the default model is GaussianCluster and it may not perform
>> well with a large feature space. Check the pdf() function which uses the
>> product of the component pdfs to produce the composite value for each
>> cluster. This may not be optimal for really large term vectors. How many
>> elements are in your term vectors? You may need to create your own model and
>> model distribution to make DPC perform on your data.
>>
>> Jeff
>>
>> -----Original Message-----
>> From: edward choi [mailto:[email protected]]
>> Sent: Tuesday, October 18, 2011 7:06 AM
>> To: [email protected]
>> Subject: Dirichlet Process Clustering not working
>>
>> Hi,
>>
>> This is my first time using Mahout, though it's been over a year playing
>> with Hadoop and Hbase.
>>
>> I collected several hundred thousand news articles from RSS. And I wanted
>> to
>> do a dirichlet process clustering(DPC) with them.
>> I did as the mahout wiki told me to do. (Making sequence files from normal
>> documents, then making them into vectors, and then doing DPC, then finally
>> clusterdumping)
>> My DPC setting was: 20 clusters. 10 iterations. 2.0 alpha. clustering
>> true,
>> emitMostLikely false. No modelDist, modelPrototype, distanceMeasure was
>> specified.
>> Number of documents were 5896. (I preprocessed the docs so that they would
>> only contain verbs and nouns).
>> The result was not what i had expected.
>>
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> C-0: GC:0{n=5896 c=[0:0.001, 0.07:0.000, 0.08:0.000, 0
>> .......................
>>    Top Terms:
>>        comment                                 =>0.015425061539023016
>>        2011                                    =>0.011413068888273332
>>        reserve                                 =>0.011253999429472274
>>        rights                                  => 0.01115527360420605
>>        use                                     =>0.010942002711960384
>>        rights reserve                          =>0.010882667414113879
>>        copyright                               =>0.010399572042096333
>>        publish                                 =>0.009924242339732702
>>        time                                    => 0.00988611270657134
>>        material                                =>0.009849842593611612
>>
>> C-1: GC:1{n=0 c=[0:-0.239, 0.07:0.775, 0.08:-0.767,.....
>>    Top Terms:.......
>>
>> C-10: GC:10{n=0 c=[0:-1.116, 0.07:-0............
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> This is what the clusterdump looks like. To my understanding, this means
>> that all the documents were assigned to one cluster point, namely C-0.
>> I changed the DPC settings around. I also changed the process of making
>> vectors a bit, but always the same result.
>> I was so out of clue, I tried Kmeans with the exact same documents and
>> vectors. And they worked!!! I don't know how I am supposed to understand
>> this.
>> I looked up google but there was no definite solution so I guess everybody
>> else is doing fine with DPC.
>>
>> Please could someone tell me what I am doing wrong? (oh, and I am using
>> standalone mode with Mahout)
>>
>> Regards,
>> Ed
>>
>
>

Reply via email to