I have a dataset of vectors in 150 dimensions. I'm playing with clustering. The vectors should be correlated in some way and so should be somewhat clusterable. The numerical space is 0.0 <= x <= 1.0 in all directions. The norm2 for the space is 1/sqrt(dimensions).

KMeans/FuzzyKMeans did not work at all. Dirichlet works with an AsymmetricSampledNormalDistribution. It stops after 24 iterations but will give as many clusters as requested. (I don't know if this is expected.)

To evaluate these clusters, I am examining the radius of each cluster. The radius is a vector of distances for each dimension for the cluster vector. I normalize these to the 0 -> 1 space with the above norm2. I do this for my own limited mathematical intuitions.

The results:
These radii, both in Canopy and Dirichlet, are all less than 1.0. Good first step. Since KMeans doesn't work, that means the clusters are probably asymmetric. The radii all have different norms. The 7 Canopy radii have, order, 5 roughly equal radii, one small and one near-zero, showing how Canopy closes in. The Dirichlet output is a different kettle of fish. First, all of the radii have several negative values. I had assumed that the radius values would all be positive. I assume this is a loose end in the Dirichlet implementation. I normalized them by adding the lowest negative value, and this is why all have a minimum value of 0.0.

Here are the Canopy and Dirichlet radius summaries. Min/Max/Norm come from the Vector implementation functions. Stddev is from the StandardDeviation class. Min/Max show the maximum skew of the radius oval, and the norm2 is a measure of the N-dimensional size of the oval.

Interpretation of Canopy: the norm2 values of 0.07 to 0.7 indicate very small to very large ovals. The stddev indicate a similarly wide range from rounded to extreme ovals.

Interpretation of Dirichlet: the norm2 values are from 0.18 to 0.30. The stddev values are in a similarly narrow range. Thus, Dirichlet was much better at finding good clusters.

Here are the raw data:

Canopies:
Stopped at 7 iterations. This is probably a function of my control values, but I don't understand them.

radius min: 0.127848, max: 0.82675, norm2: 0.10244, stddev: 0.97049228
radius min: 0.428303, max: 0.14688, norm2: 0.200042, stddev: 0.4691054
radius min: 0.953668, max: 0.037329, norm2: 0.076551, stddev: 0.1969004
radius min: 0.66706, max: 0.177616, norm2: 0.143568, stddev: 0.533347
radius min: 0.3834656, max: 0.093145, norm2: 0.771413, stddev: 0.2727437
radius min: 1.97559E-4, max: 0.26654, norm2: 0.476613, stddev: 0.883297
radius min: 2.72014E-308, max: 2.72014E-308, norm2: 0.0, stddev: 0.0



Dirichlet Clusters:
Allowed 50 iterations. Stopped at 24.
Length of cluster: 20
radius: min: 0.0, max: 0.4351731317613352, norm2: 0.23724015155870853, stddev: 0.08589605994209457 radius: min: 0.0, max: 0.43454768778264424, norm2: 0.2182967655938718, stddev: 0.07820922525108506 radius: min: 0.0, max: 0.4257544561005417, norm2: 0.2347278105757725, stddev: 0.08638504183578334 radius: min: 0.0, max: 0.3898861055534767, norm2: 0.19936157038662167, stddev: 0.07993022684323048 radius: min: 0.0, max: 0.4185431782190273, norm2: 0.23324067341509705, stddev: 0.08034437465659874 radius: min: 0.0, max: 0.48882417838386466, norm2: 0.2787118963689076, stddev: 0.08671269891849095 radius: min: 0.0, max: 0.4090508499677522, norm2: 0.22883939621598232, stddev: 0.07748583078136284 radius: min: 0.0, max: 0.4325558610552059, norm2: 0.2603226102820014, stddev: 0.07554530388950208 radius: min: 0.0, max: 0.39777198040477896, norm2: 0.24684110862692255, stddev: 0.08672454426749251 radius: min: 0.0, max: 0.531677760146581, norm2: 0.28330820569569837, stddev: 0.08448515053243884 radius: min: 0.0, max: 0.42377556269801, norm2: 0.22890581071124907, stddev: 0.08540251357878932 radius: min: 0.0, max: 0.4472174697924406, norm2: 0.20354417891408141, stddev: 0.08067777317911734 radius: min: 0.0, max: 0.3774646209477964, norm2: 0.2016034439565245, stddev: 0.08120738045804161 radius: min: 0.0, max: 0.41582209335459364, norm2: 0.25225877921586715, stddev: 0.08871816315297622 radius: min: 0.0, max: 0.4879159014228414, norm2: 0.21942117011373538, stddev: 0.08855141098098554 radius: min: 0.0, max: 0.4270525201114075, norm2: 0.20018637560733332, stddev: 0.08140121090799231 radius: min: 0.0, max: 0.4722323927707502, norm2: 0.27442792099816604, stddev: 0.08298142189530944 radius: min: 0.0, max: 0.37702578805324927, norm2: 0.23873257664491646, stddev: 0.0792203837674309 radius: min: 0.0, max: 0.3704620593571561, norm2: 0.19700023808010425, stddev: 0.08031515132089997 radius: min: 0.0, max: 0.46505290711258623, norm2: 0.26738097650066134, stddev: 0.07475626794172856
~



Reply via email to