IN order to succeed here, SKM will need to have maxClusters set to 20,000 or so.
The maximum distance between clusters on a 10d hypercube is sqrt(10) = 3.1 or so. If three clusters get smashed together, then you have a threshold of 1.4 or so. On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <[email protected]>wrote: > I wanted there to be 2^d clusters. I was wrong and didn't check: the > radius is in fact 0.01. > > What's happening is that for 10 dimension, I was expecting ~1024 > clusters (or at least have small distances) but StreamingKMeans fails > on both accounts. > BallKMeans does in fact get the clusters. > > So, yes, it's probably a bug of some kind since I end up with anywhere > between 400 and 1000 clusters (based on the searcher used) but the > distances are still wrong. > > Here's how many clusters I get and the searchers I get them with [1]. > As you can see, the number of clusters is all over the place. > > The distance too is also super huge. The assert said that all > distances should be less than 0.05. > Here is where it fails [2]. > And here is the corresponding GitHub issue (no info yet) [3]. > > [1] https://gist.github.com/4220406 > [2] > https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104 > [3] https://github.com/dfilimon/knn/issues/1 > > On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <[email protected]> wrote: > > How many clusters are you talking about? > > > > If you pick a modest number then streaming k-means should work well if it > > has several times more surrogate points than there are clusters. > > > > Also, typically a hyper-cube test works with very small cluster radius. > Try > > 0.1 or 0.01. Otherwise, your clusters overlap and the theoretical > > guarantees go out the window. Without the guarantees, it is hard to > > interpret test results. With small radii, and a modest number of > clusters, > > what should happen is that the threshold in streaming k-means quickly > adapts > > but stays << 1 which is the minimum distance between clusters. That > > guarantees that we will have at least 1 surrogate in each real cluster. > > > > Failure modes I can imagine could include: > > > > a) threshold gets very big and the number of surrogates drops to 1 due > to a > > bug. > > > > b) unit test has exponentially many clusters (all corners = 2^d). This > will > > cause the threshold to be increased to 1 or larger and will cause us to > try > > to cover many clusters with a single surrogate. > > > > c) something else (always possible) > > > > > > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon < > [email protected]> > > wrote: > >> > >> Okay, please disregard the previous e-mail. > >> That hypothesis is toast; clustering works just fine with ball k-means. > >> > >> So, the problem lies in streaming k-means somewhere. > >> > >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon > >> <[email protected]> wrote: > >> > Hi, > >> > > >> > One of the most basic tests for streaming k-means (and k-means in > >> > general) is whether it works well for points that are multi-normally > >> > distributed around the vertices of a unit cube. > >> > > >> > So, for a cube, there'd be 8 vertices in 3d space. Generating > >> > thousands of points should cluster them in those 8 clusters and they > >> > should be relatively close to the means of these multinormal > >> > distributions. > >> > > >> > I decided to generalize it to more than 3 dimensions, and see how it > >> > works for hypercubes with n dimensions and 2^n vertices. > >> > > >> > Not well it turns out. > >> > > >> > The clusters become less balanced as the number of dimensions > increases. > >> > I'm not sure if this is to be expected. I understand that in high > >> > dimensional spaces, it becomes more likely for distances to be equal > >> > and vectors to be orthogonal, but I'm seeing issues starting at 5 > >> > dimensions and this doesn't seem like a particularly high number of > >> > dimension to me. > >> > > >> > Is this normal? > >> > Should the hypercube no longer have all sides equal to 1? The variance > >> > of the multinormals is also 1. > >> > > >> > Thanks! > > > > >
