As a side issue: DirichletClusteret.getRadius() can include negative
values. What kind of radius is that?

On Sun, Nov 7, 2010 at 5:29 PM, Lance Norskog <[email protected]> wrote:
> Snerk! Given that I don't know what I'm doing, that's not a surprise.
>
>> How did you come up with a single radius here?
> I made 7 vectors with Canopy, and fed them to k-means. The only output
> from k-means that I can see are the center, centroid, and radius
> vectors. It does not seem to have a list of the data vectors that it
> contains. But I have 7 radius vectors, not one.
>
> Anyway, I've made progress. I went off to KNime and discovered A) a
> k-means that will assign test data points to the k-means partition
> created from the training set, and B) an MDS that lets me visualize
> the 'clusterability'.
>
> MDS is a dimension reduction algorithm for reducing multi-dimensional
> vectors to 2 or 3 dimensions. KNime lets me visualize:
> * the k-means centers for the training data
> * the k-means partition from the training data applied to the test data
> * the partition applied to random data
>
> The test data and random data plots looked the same. The training data
> created a beautiful 2D normal distribution plot, centered at the
> center. The test and random data both created a more random plot,
> recognizably normal distribution, but centered off to the side. This
> follows your advice that test and training data should have different
> distributions.
>
> This whole exercise has confirmed that my vector generation does a good job.
>
> KNime is great for this.  www.knime.org - cannot recommend it highly
> enough for data mining, especially for beginners.
>
> On Sat, Nov 6, 2010 at 6:49 PM, Ted Dunning <[email protected]> wrote:
>> This has several things that make my spidey sense tingle.
>>
>> On Sat, Nov 6, 2010 at 5:29 PM, Lance Norskog <[email protected]> wrote:
>>
>>> I have a dataset of vectors in 150 dimensions. I'm playing with clustering.
>>> The vectors should be correlated in some way and so should be somewhat
>>> clusterable. The numerical space is 0.0 <= x <= 1.0 in all directions. The
>>> norm2 for the space is 1/sqrt(dimensions).
>>>
>>
>> What does "norm2 for the space" mean?  Normally a norm is applied to a
>> vector and as a side effect to a matrix.
>>
>>
>>> KMeans/FuzzyKMeans did not work at all.
>>
>>
>> That seems odd and somewhat unusual.  150 dimensions is larger than this
>> kind of clustering works well, but it seems like kmeans should have given
>> some kind of result.  What did you observe?
>>
>>
>>> Dirichlet works with an AsymmetricSampledNormalDistribution. It stops after
>>> 24 iterations but will give as many clusters as requested. (I don't know if
>>> this is expected.)
>>>
>>
>> Giving the number of clusters you specify is, I think, normal here.
>>
>>
>>> To evaluate these clusters, I am examining the radius of each cluster. The
>>> radius is a vector of distances for each dimension for the cluster vector. I
>>> normalize these to the 0 -> 1 space with the above norm2. I do this for my
>>> own limited mathematical intuitions.
>>>
>>
>> This is a little unusual.
>>
>> For k-means the normal things to look at are 0) the distribution of
>> distances between randomly distributed synthetic points, 1) the
>> distribution of distance between randomly selected data points, 2) the
>> distribution of distances between a point and a randomly selected
>> centroid and 3) the distribution of distances to the nearest centroid.
>>  Looking at these for the training data and for held out data is ideal.
>> All of these distances should be computed without any normalization.
>>
>> What you should look at includes:
>>
>> -  whether distribution 0 and distribution 1 are radically different.
>>  Different is actually kind of good here because it means that your
>> points aren't just spread out all over
>>
>> - how different 2 and 3 are and how different distribution 3 is for training
>> data and held out data.  2 and 3 should be distinctly different
>> and distribution 3 should be pretty similar for held out data.
>>
>> For any clustering at all, I like to compare the number of points that are
>> clustered into the different clusters for held out data versus
>> for the training data.  The proportions should be about the same.
>>
>> The results:
>>> These radii, both in Canopy and Dirichlet, are all less than 1.0. Good
>>> first step. Since KMeans doesn't work, that means the clusters are probably
>>> asymmetric. The radii all have different norms. The 7 Canopy radii have,
>>> order, 5 roughly equal radii, one small and one near-zero, showing how
>>> Canopy closes in. The Dirichlet output is a different kettle of fish. First,
>>> all of the radii have several negative values. I had assumed that the radius
>>> values would all be positive. I assume this is a loose end in the Dirichlet
>>> implementation. I normalized them by adding the lowest negative value, and
>>> this is why all have a minimum value of 0.0.
>>>
>>
>> I can't help with expectations for what these should look like.  The
>> normalization makes it very hard to understand.  How did you compute
>> distance to a Dirichlet cluster?
>>
>>
>>> Here are the Canopy and Dirichlet radius summaries.
>>
>>
>> How did you come up with a single radius here?
>>
>>
>>>
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>



-- 
Lance Norskog
[email protected]

Reply via email to