Still not that odd if several clusters are getting squashed.  This can
happen if the threshold increases too high or if the searcher is unable to
resolve the cube properly.  By its nature, the cube is hard to reduce to a
smaller dimension.

On Thu, Dec 6, 2012 at 12:36 AM, Dan Filimon <[email protected]>wrote:

> But the weight referred to is the distance between a centroid and the
> mean of a distribution (a cube vertice).
> This should still be very small (also BallKMeans gets it right).
>
> On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <[email protected]> wrote:
> > IN order to succeed here, SKM will need to have maxClusters set to
> 20,000 or
> > so.
> >
> > The maximum distance between clusters on a 10d hypercube is sqrt(10) =
> 3.1
> > or so.  If three clusters get smashed together, then you have a
> threshold of
> > 1.4 or so.
> >
> >
> > On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <
> [email protected]>
> > wrote:
> >>
> >> I wanted there to be 2^d clusters. I was wrong and didn't check: the
> >> radius is in fact 0.01.
> >>
> >> What's happening is that for 10 dimension, I was expecting ~1024
> >> clusters (or at least have small distances) but StreamingKMeans fails
> >> on both accounts.
> >> BallKMeans does in fact get the clusters.
> >>
> >> So, yes, it's probably a bug of some kind since I end up with anywhere
> >> between 400 and 1000 clusters (based on the searcher used) but the
> >> distances are still wrong.
> >>
> >> Here's how many clusters I get and the searchers I get them with [1].
> >> As you can see, the number of clusters is all over the place.
> >>
> >> The distance too is also super huge. The assert said that all
> >> distances should be less than 0.05.
> >> Here is where it fails [2].
> >> And here is the corresponding GitHub issue (no info yet) [3].
> >>
> >> [1] https://gist.github.com/4220406
> >> [2]
> >>
> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
> >> [3] https://github.com/dfilimon/knn/issues/1
> >>
> >> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <[email protected]>
> wrote:
> >> > How many clusters are you talking about?
> >> >
> >> > If you pick a modest number then streaming k-means should work well if
> >> > it
> >> > has several times more surrogate points than there are clusters.
> >> >
> >> > Also, typically a hyper-cube test works with very small cluster
> radius.
> >> > Try
> >> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
> >> > guarantees go out the window.  Without the guarantees, it is hard to
> >> > interpret test results.  With small radii, and a modest number of
> >> > clusters,
> >> > what should happen is that the threshold in streaming k-means quickly
> >> > adapts
> >> > but stays << 1 which is the minimum distance between clusters.  That
> >> > guarantees that we will have at least 1 surrogate in each real
> cluster.
> >> >
> >> > Failure modes I can imagine could include:
> >> >
> >> > a) threshold gets very big and the number of surrogates drops to 1 due
> >> > to a
> >> > bug.
> >> >
> >> > b) unit test has exponentially many clusters (all corners = 2^d).
>  This
> >> > will
> >> > cause the threshold to be increased to 1 or larger and will cause us
> to
> >> > try
> >> > to cover many clusters with a single surrogate.
> >> >
> >> > c) something else (always possible)
> >> >
> >> >
> >> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
> >> > <[email protected]>
> >> > wrote:
> >> >>
> >> >> Okay, please disregard the previous e-mail.
> >> >> That hypothesis is toast; clustering works just fine with ball
> k-means.
> >> >>
> >> >> So, the problem lies in streaming k-means somewhere.
> >> >>
> >> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
> >> >> <[email protected]> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > One of the most basic tests for streaming k-means (and k-means in
> >> >> > general) is whether it works well for points that are
> multi-normally
> >> >> > distributed around the vertices of a unit cube.
> >> >> >
> >> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
> >> >> > thousands of points should cluster them in those 8 clusters and
> they
> >> >> > should be relatively close to the means of these multinormal
> >> >> > distributions.
> >> >> >
> >> >> > I decided to generalize it to more than 3 dimensions, and see how
> it
> >> >> > works for hypercubes with n dimensions and 2^n vertices.
> >> >> >
> >> >> > Not well it turns out.
> >> >> >
> >> >> > The clusters become less balanced as the number of dimensions
> >> >> > increases.
> >> >> > I'm not sure if this is to be expected. I understand that in high
> >> >> > dimensional spaces, it becomes more likely for distances to be
> equal
> >> >> > and vectors to be orthogonal, but I'm seeing issues starting at 5
> >> >> > dimensions and this doesn't seem like a particularly high number of
> >> >> > dimension to me.
> >> >> >
> >> >> > Is this normal?
> >> >> > Should the hypercube no longer have all sides equal to 1? The
> >> >> > variance
> >> >> > of the multinormals is also 1.
> >> >> >
> >> >> > Thanks!
> >> >
> >> >
> >
> >
>

Reply via email to