How many clusters are you talking about?

If you pick a modest number then streaming k-means should work well if it
has several times more surrogate points than there are clusters.

Also, typically a hyper-cube test works with very small cluster radius.
 Try 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
guarantees go out the window.  Without the guarantees, it is hard to
interpret test results.  With small radii, and a modest number of clusters,
what should happen is that the threshold in streaming k-means quickly
adapts but stays << 1 which is the minimum distance between clusters.  That
guarantees that we will have at least 1 surrogate in each real cluster.

Failure modes I can imagine could include:

a) threshold gets very big and the number of surrogates drops to 1 due to a
bug.

b) unit test has exponentially many clusters (all corners = 2^d).  This
will cause the threshold to be increased to 1 or larger and will cause us
to try to cover many clusters with a single surrogate.

c) something else (always possible)

On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon <[email protected]>wrote:

> Okay, please disregard the previous e-mail.
> That hypothesis is toast; clustering works just fine with ball k-means.
>
> So, the problem lies in streaming k-means somewhere.
>
> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> <[email protected]> wrote:
> > Hi,
> >
> > One of the most basic tests for streaming k-means (and k-means in
> > general) is whether it works well for points that are multi-normally
> > distributed around the vertices of a unit cube.
> >
> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> > thousands of points should cluster them in those 8 clusters and they
> > should be relatively close to the means of these multinormal
> > distributions.
> >
> > I decided to generalize it to more than 3 dimensions, and see how it
> > works for hypercubes with n dimensions and 2^n vertices.
> >
> > Not well it turns out.
> >
> > The clusters become less balanced as the number of dimensions increases.
> > I'm not sure if this is to be expected. I understand that in high
> > dimensional spaces, it becomes more likely for distances to be equal
> > and vectors to be orthogonal, but I'm seeing issues starting at 5
> > dimensions and this doesn't seem like a particularly high number of
> > dimension to me.
> >
> > Is this normal?
> > Should the hypercube no longer have all sides equal to 1? The variance
> > of the multinormals is also 1.
> >
> > Thanks!
>

Reply via email to