My data is from ~4500 Wikipedia articles. I stripped out the wiki markup,
ran them through seq2sparse, and then reduced to 100 dimensions with ssvd
before running kmeans.

I re-ran my test with some slightly tweaked parameters to see if I could
improve the clustering. My pdf values for the most likely clusters improved
a little bit, but not dramatically.

Taking the most likely cluster's pdf value for each point, I got a minimum
pdf of 0.0215, a maximum pdf of 0.0377, and a mean pdf value of 0.0282

Looking at all 50 pdf values for each point, I got a minimum pdf of
0.0.0174, and a mean pdf value of 0.0200.

Do these pdf values say anything about the fit or quality of my cluster
results?


On Fri, Mar 1, 2013 at 2:56 AM, Ted Dunning <[email protected]> wrote:

> How high is the dimension?
>
> How is your data generated?
>
>
>
> On Wed, Feb 27, 2013 at 1:38 PM, Matt Molek <[email protected]> wrote:
>
> > I made a small modification to the KMeansDriver to call the
> > ClusterClassificationDriver with an emitMostLikely value of false so
> that I
> > could see what the pdf values of my points were for all k of my clusters.
> >
> > I was expecting the most likely cluster to have a much higher pdf than
> the
> > other clusters in most cases, but in my results, all the values are
> pretty
> > close to 1/(number of clusters)
> >
> > For example, when I ran with 50 clusters, most of my points had a pdf
> value
> > of 0.02xx for nearly every cluster.
> >
> > I understand that to mean that for most of my points, none of my clusters
> > are a good fit. Is that right? Or is it common for for the most likely
> > cluster to only deviate tiny bit from all the others? (I wouldn't think
> so)
> >
> > Thanks for the advice,
> > Matt
> >
>

Reply via email to