My data is from ~4500 Wikipedia articles. I stripped out the wiki markup, ran them through seq2sparse, and then reduced to 100 dimensions with ssvd before running kmeans.
I re-ran my test with some slightly tweaked parameters to see if I could improve the clustering. My pdf values for the most likely clusters improved a little bit, but not dramatically. Taking the most likely cluster's pdf value for each point, I got a minimum pdf of 0.0215, a maximum pdf of 0.0377, and a mean pdf value of 0.0282 Looking at all 50 pdf values for each point, I got a minimum pdf of 0.0.0174, and a mean pdf value of 0.0200. Do these pdf values say anything about the fit or quality of my cluster results? On Fri, Mar 1, 2013 at 2:56 AM, Ted Dunning <[email protected]> wrote: > How high is the dimension? > > How is your data generated? > > > > On Wed, Feb 27, 2013 at 1:38 PM, Matt Molek <[email protected]> wrote: > > > I made a small modification to the KMeansDriver to call the > > ClusterClassificationDriver with an emitMostLikely value of false so > that I > > could see what the pdf values of my points were for all k of my clusters. > > > > I was expecting the most likely cluster to have a much higher pdf than > the > > other clusters in most cases, but in my results, all the values are > pretty > > close to 1/(number of clusters) > > > > For example, when I ran with 50 clusters, most of my points had a pdf > value > > of 0.02xx for nearly every cluster. > > > > I understand that to mean that for most of my points, none of my clusters > > are a good fit. Is that right? Or is it common for for the most likely > > cluster to only deviate tiny bit from all the others? (I wouldn't think > so) > > > > Thanks for the advice, > > Matt > > >
