Okay, please disregard the previous e-mail. That hypothesis is toast; clustering works just fine with ball k-means.
So, the problem lies in streaming k-means somewhere. On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon <[email protected]> wrote: > Hi, > > One of the most basic tests for streaming k-means (and k-means in > general) is whether it works well for points that are multi-normally > distributed around the vertices of a unit cube. > > So, for a cube, there'd be 8 vertices in 3d space. Generating > thousands of points should cluster them in those 8 clusters and they > should be relatively close to the means of these multinormal > distributions. > > I decided to generalize it to more than 3 dimensions, and see how it > works for hypercubes with n dimensions and 2^n vertices. > > Not well it turns out. > > The clusters become less balanced as the number of dimensions increases. > I'm not sure if this is to be expected. I understand that in high > dimensional spaces, it becomes more likely for distances to be equal > and vectors to be orthogonal, but I'm seeing issues starting at 5 > dimensions and this doesn't seem like a particularly high number of > dimension to me. > > Is this normal? > Should the hypercube no longer have all sides equal to 1? The variance > of the multinormals is also 1. > > Thanks!
