As a side issue: DirichletClusteret.getRadius() can include negative values. What kind of radius is that?
On Sun, Nov 7, 2010 at 5:29 PM, Lance Norskog <[email protected]> wrote: > Snerk! Given that I don't know what I'm doing, that's not a surprise. > >> How did you come up with a single radius here? > I made 7 vectors with Canopy, and fed them to k-means. The only output > from k-means that I can see are the center, centroid, and radius > vectors. It does not seem to have a list of the data vectors that it > contains. But I have 7 radius vectors, not one. > > Anyway, I've made progress. I went off to KNime and discovered A) a > k-means that will assign test data points to the k-means partition > created from the training set, and B) an MDS that lets me visualize > the 'clusterability'. > > MDS is a dimension reduction algorithm for reducing multi-dimensional > vectors to 2 or 3 dimensions. KNime lets me visualize: > * the k-means centers for the training data > * the k-means partition from the training data applied to the test data > * the partition applied to random data > > The test data and random data plots looked the same. The training data > created a beautiful 2D normal distribution plot, centered at the > center. The test and random data both created a more random plot, > recognizably normal distribution, but centered off to the side. This > follows your advice that test and training data should have different > distributions. > > This whole exercise has confirmed that my vector generation does a good job. > > KNime is great for this. www.knime.org - cannot recommend it highly > enough for data mining, especially for beginners. > > On Sat, Nov 6, 2010 at 6:49 PM, Ted Dunning <[email protected]> wrote: >> This has several things that make my spidey sense tingle. >> >> On Sat, Nov 6, 2010 at 5:29 PM, Lance Norskog <[email protected]> wrote: >> >>> I have a dataset of vectors in 150 dimensions. I'm playing with clustering. >>> The vectors should be correlated in some way and so should be somewhat >>> clusterable. The numerical space is 0.0 <= x <= 1.0 in all directions. The >>> norm2 for the space is 1/sqrt(dimensions). >>> >> >> What does "norm2 for the space" mean? Normally a norm is applied to a >> vector and as a side effect to a matrix. >> >> >>> KMeans/FuzzyKMeans did not work at all. >> >> >> That seems odd and somewhat unusual. 150 dimensions is larger than this >> kind of clustering works well, but it seems like kmeans should have given >> some kind of result. What did you observe? >> >> >>> Dirichlet works with an AsymmetricSampledNormalDistribution. It stops after >>> 24 iterations but will give as many clusters as requested. (I don't know if >>> this is expected.) >>> >> >> Giving the number of clusters you specify is, I think, normal here. >> >> >>> To evaluate these clusters, I am examining the radius of each cluster. The >>> radius is a vector of distances for each dimension for the cluster vector. I >>> normalize these to the 0 -> 1 space with the above norm2. I do this for my >>> own limited mathematical intuitions. >>> >> >> This is a little unusual. >> >> For k-means the normal things to look at are 0) the distribution of >> distances between randomly distributed synthetic points, 1) the >> distribution of distance between randomly selected data points, 2) the >> distribution of distances between a point and a randomly selected >> centroid and 3) the distribution of distances to the nearest centroid. >> Looking at these for the training data and for held out data is ideal. >> All of these distances should be computed without any normalization. >> >> What you should look at includes: >> >> - whether distribution 0 and distribution 1 are radically different. >> Different is actually kind of good here because it means that your >> points aren't just spread out all over >> >> - how different 2 and 3 are and how different distribution 3 is for training >> data and held out data. 2 and 3 should be distinctly different >> and distribution 3 should be pretty similar for held out data. >> >> For any clustering at all, I like to compare the number of points that are >> clustered into the different clusters for held out data versus >> for the training data. The proportions should be about the same. >> >> The results: >>> These radii, both in Canopy and Dirichlet, are all less than 1.0. Good >>> first step. Since KMeans doesn't work, that means the clusters are probably >>> asymmetric. The radii all have different norms. The 7 Canopy radii have, >>> order, 5 roughly equal radii, one small and one near-zero, showing how >>> Canopy closes in. The Dirichlet output is a different kettle of fish. First, >>> all of the radii have several negative values. I had assumed that the radius >>> values would all be positive. I assume this is a loose end in the Dirichlet >>> implementation. I normalized them by adding the lowest negative value, and >>> this is why all have a minimum value of 0.0. >>> >> >> I can't help with expectations for what these should look like. The >> normalization makes it very hard to understand. How did you compute >> distance to a Dirichlet cluster? >> >> >>> Here are the Canopy and Dirichlet radius summaries. >> >> >> How did you come up with a single radius here? >> >> >>> >> > > > > -- > Lance Norskog > [email protected] > -- Lance Norskog [email protected]
