But the weight referred to is the distance between a centroid and the
mean of a distribution (a cube vertice).
This should still be very small (also BallKMeans gets it right).

On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning <[email protected]> wrote:
> IN order to succeed here, SKM will need to have maxClusters set to 20,000 or
> so.
>
> The maximum distance between clusters on a 10d hypercube is sqrt(10) = 3.1
> or so.  If three clusters get smashed together, then you have a threshold of
> 1.4 or so.
>
>
> On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon <[email protected]>
> wrote:
>>
>> I wanted there to be 2^d clusters. I was wrong and didn't check: the
>> radius is in fact 0.01.
>>
>> What's happening is that for 10 dimension, I was expecting ~1024
>> clusters (or at least have small distances) but StreamingKMeans fails
>> on both accounts.
>> BallKMeans does in fact get the clusters.
>>
>> So, yes, it's probably a bug of some kind since I end up with anywhere
>> between 400 and 1000 clusters (based on the searcher used) but the
>> distances are still wrong.
>>
>> Here's how many clusters I get and the searchers I get them with [1].
>> As you can see, the number of clusters is all over the place.
>>
>> The distance too is also super huge. The assert said that all
>> distances should be less than 0.05.
>> Here is where it fails [2].
>> And here is the corresponding GitHub issue (no info yet) [3].
>>
>> [1] https://gist.github.com/4220406
>> [2]
>> https://github.com/dfilimon/knn/blob/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
>> [3] https://github.com/dfilimon/knn/issues/1
>>
>> On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning <[email protected]> wrote:
>> > How many clusters are you talking about?
>> >
>> > If you pick a modest number then streaming k-means should work well if
>> > it
>> > has several times more surrogate points than there are clusters.
>> >
>> > Also, typically a hyper-cube test works with very small cluster radius.
>> > Try
>> > 0.1 or 0.01.  Otherwise, your clusters overlap and the theoretical
>> > guarantees go out the window.  Without the guarantees, it is hard to
>> > interpret test results.  With small radii, and a modest number of
>> > clusters,
>> > what should happen is that the threshold in streaming k-means quickly
>> > adapts
>> > but stays << 1 which is the minimum distance between clusters.  That
>> > guarantees that we will have at least 1 surrogate in each real cluster.
>> >
>> > Failure modes I can imagine could include:
>> >
>> > a) threshold gets very big and the number of surrogates drops to 1 due
>> > to a
>> > bug.
>> >
>> > b) unit test has exponentially many clusters (all corners = 2^d).  This
>> > will
>> > cause the threshold to be increased to 1 or larger and will cause us to
>> > try
>> > to cover many clusters with a single surrogate.
>> >
>> > c) something else (always possible)
>> >
>> >
>> > On Wed, Dec 5, 2012 at 11:38 PM, Dan Filimon
>> > <[email protected]>
>> > wrote:
>> >>
>> >> Okay, please disregard the previous e-mail.
>> >> That hypothesis is toast; clustering works just fine with ball k-means.
>> >>
>> >> So, the problem lies in streaming k-means somewhere.
>> >>
>> >> On Thu, Dec 6, 2012 at 12:06 AM, Dan Filimon
>> >> <[email protected]> wrote:
>> >> > Hi,
>> >> >
>> >> > One of the most basic tests for streaming k-means (and k-means in
>> >> > general) is whether it works well for points that are multi-normally
>> >> > distributed around the vertices of a unit cube.
>> >> >
>> >> > So, for a cube, there'd be 8 vertices in 3d space. Generating
>> >> > thousands of points should cluster them in those 8 clusters and they
>> >> > should be relatively close to the means of these multinormal
>> >> > distributions.
>> >> >
>> >> > I decided to generalize it to more than 3 dimensions, and see how it
>> >> > works for hypercubes with n dimensions and 2^n vertices.
>> >> >
>> >> > Not well it turns out.
>> >> >
>> >> > The clusters become less balanced as the number of dimensions
>> >> > increases.
>> >> > I'm not sure if this is to be expected. I understand that in high
>> >> > dimensional spaces, it becomes more likely for distances to be equal
>> >> > and vectors to be orthogonal, but I'm seeing issues starting at 5
>> >> > dimensions and this doesn't seem like a particularly high number of
>> >> > dimension to me.
>> >> >
>> >> > Is this normal?
>> >> > Should the hypercube no longer have all sides equal to 1? The
>> >> > variance
>> >> > of the multinormals is also 1.
>> >> >
>> >> > Thanks!
>> >
>> >
>
>

Reply via email to