Also, if only a single point is assigned to the cluster, the radius of the
cluster is, by definition, zero.

That isn't a bug.


On Wed, May 15, 2013 at 10:15 AM, Jeff Eastman <
[email protected]> wrote:

>  What you have observed is correct. During the final iteration, points are
> observed by each cluster and these observations are used to calculate the
> new cluster center and radius. As that center moves less than the
> convergence delta from the previous center, the iterations stop. During the
> subsequent classification phase, each point is assigned to its most likely
> cluster and this assignment may not always be to the same cluster due to
> this final cluster center movement.
>
> Decreasing the convergence delta and thus running more iterations may help
> to resolve this problem; however, there are situations with KMeans where
> the end state can oscillate between two or more very similar clusterings. I
> think the only way to predictably use a post-classification radius is to
> recalculate it at the end.
>
>
>
>
> On 5/14/13 2:19 PM, Erinn Schorsch wrote:
>
>  Thanks Jeff.****
>
> ** **
>
> After some additional investigation on our side, we find the
> math/std-deviation calculation to be correct, and that our data does have a
> radius of 0 (at KMeans Cluster Identification time)… all points were of the
> same value.****
>
> ** **
>
> The problem however is that we run the KMeans Classification process
> subsequently,… and it returns a set of vectors classified to the cluster in
> question, which have different values than the set from
> ClusterIdentification time. These points are not all the same value. The
> reason for this, is that at the end of ClusterIdentification, the
> center/radius are calculated… using this new center for the Classification
> run, the measurements for each vector vary from the last iteration of
> ClusterIdentification, and produce a different categorization of the data,…
> so the radius from the final iteration of ClusterIdentification does not
> represent the std-deviation of the classification results.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Jeff Eastman 
> [mailto:[email protected]<[email protected]>]
>
> *Sent:* Tuesday, May 14, 2013 11:10 AM
> *To:* Erinn Schorsch
> *Cc:* [email protected]
> *Subject:* Re: Problems with KMeans Clustering - Radius calculation
> returns incorrect ZERO value in some cases.****
>
> ** **
>
> Hi Erinn,
>
> The radius calculation in KMeans and other clustering algorithms uses a
> running sums algorithm (see RunningSumsGaussianAccumulator) and the radius
> is really the standard deviation produced by this method. In this method
> (as you likely know) s0 is the number of points observed, s1 is the sum of
> those points and s2 is the sum of the squares of those points. This
> algorithm has some documented roundoff issues but your problem does not
> look like roundoff. You have not included the points in your example, but
> if they are all the same value for a cluster then I would expect their std
> and radius to be zero.
>
> Jeff
>
> ****
>
> On 5/9/13 9:17 PM, Erinn Schorsch wrote:****
>
> I am working on an application using mahout KMeans clustering and
> classification. We use Canopy clusters to seed KMeans although I don’t
> believe this to be relevant for this issue.****
>
>  ****
>
> Has anyone else experienced this issue? (details follow).  Does anyone
> have any insight on whether radius=0 will affect if KMeans convergence is
> arrived at? (and therefore drive premature convergence when miscalculated
> to be 0).****
>
>  ****
>
> We have discovered what appears to me a defect in how the radius value is
> calculated for the clusters that Mahout/Kmeans generates. Generally, we are
> expecting that when a cluster includes data points (observations) which
> vary from the center-point, then radius should be some non-zero value. Our
> understanding is that a bigger radius, means a larger range of values,… **
> **
>
>  ****
>
> We found cases where several data points where part of a cluster, however
> the radius is returned as 0. We use the radius to evaluate how
> usable/relevant each cluster is to our use case, so getting accurate radius
> is important in our case.****
>
>  ****
>
> We are using Mahout version 0.7****
>
>  ****
>
>  ****
>
> Details:****
>
>  ****
>
> When calculating the KMeans clusters, the radius is calculated by the
> method: AbstractCluster.computeParamters()****
>
>  ****
>
>   @Override****
>
>   public void computeParameters() {****
>
>     if (getS0() == 0) {****
>
>       return;****
>
>     }****
>
>     setNumObservations((long) getS0());****
>
>     setTotalObservations(getTotalObservations() + getNumObservations());**
> **
>
>     setCenter(getS1().divide(getS0()));****
>
>     // compute the component stds****
>
>     if (getS0() > 1) {****
>
> *
> setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
> SquareRootFunction()).divide(getS0()));*****
>
>     }****
>
>     setS0(0);****
>
>     setS1(center.like());****
>
>     setS2(center.like());****
>
>   }****
>
>  ****
>
> The important bit:*
> setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
> SquareRootFunction()).divide(getS0()));*****
>
>  ****
>
> Or, simplified/paraphrased:  (S2 * S0) minus (S1 * S1)… then divide using
> a SquareRootFunction… (this last I don’t think affects our scenario).****
>
>  ****
>
>  ****
>
> Data in our case:****
>
>  ****
>
> S0=6.0****
>
> S1={1:150.0,0:54.0}****
>
> S2={1:3750.0,0:486.0}****
>
>  ****
>
> getS2().times(getS0())={1:22500.0,0:2916.0}****
>
> getS1().times(getS1())={1:22500.0,0:2916.0}****
>
>  ****
>
> And follows:  ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}****
>
>  ****
>
> … Thus, we have a ZERO value for radius on this point/cluster…. And
> clearly this should not be the case. S1 and S2 represent 2 of the 6
> “observations” (S0 is number of observations)… and S1 and S2 are different
> data points,… so radius must be non-zero?****
>
>  ****
>
> Is this a defect? Is there a flaw in our understanding/expectation of
> radius?****
>
>  ****
>
> It seems the algorithm for radius here equates basically to: (A * B) – ( C
> * C)****
>
> It is easy to imagine many mathematical combinations where this equation
> will compute to ZERO. Can this be correct behavior?****
>
>  ****
>
> Thanks for any thoughts / input!****
>
> ** **
>
>
>

Reply via email to