Thanks Jeff.

After some additional investigation on our side, we find the math/std-deviation 
calculation to be correct, and that our data does have a radius of 0 (at KMeans 
Cluster Identification time)... all points were of the same value.

The problem however is that we run the KMeans Classification process 
subsequently,... and it returns a set of vectors classified to the cluster in 
question, which have different values than the set from ClusterIdentification 
time. These points are not all the same value. The reason for this, is that at 
the end of ClusterIdentification, the center/radius are calculated... using 
this new center for the Classification run, the measurements for each vector 
vary from the last iteration of ClusterIdentification, and produce a different 
categorization of the data,... so the radius from the final iteration of 
ClusterIdentification does not represent the std-deviation of the 
classification results.



From: Jeff Eastman [mailto:[email protected]]
Sent: Tuesday, May 14, 2013 11:10 AM
To: Erinn Schorsch
Cc: [email protected]
Subject: Re: Problems with KMeans Clustering - Radius calculation returns 
incorrect ZERO value in some cases.

Hi Erinn,

The radius calculation in KMeans and other clustering algorithms uses a running 
sums algorithm (see RunningSumsGaussianAccumulator) and the radius is really 
the standard deviation produced by this method. In this method (as you likely 
know) s0 is the number of points observed, s1 is the sum of those points and s2 
is the sum of the squares of those points. This algorithm has some documented 
roundoff issues but your problem does not look like roundoff. You have not 
included the points in your example, but if they are all the same value for a 
cluster then I would expect their std and radius to be zero.

Jeff

On 5/9/13 9:17 PM, Erinn Schorsch wrote:
I am working on an application using mahout KMeans clustering and 
classification. We use Canopy clusters to seed KMeans although I don't believe 
this to be relevant for this issue.

Has anyone else experienced this issue? (details follow).  Does anyone have any 
insight on whether radius=0 will affect if KMeans convergence is arrived at? 
(and therefore drive premature convergence when miscalculated to be 0).

We have discovered what appears to me a defect in how the radius value is 
calculated for the clusters that Mahout/Kmeans generates. Generally, we are 
expecting that when a cluster includes data points (observations) which vary 
from the center-point, then radius should be some non-zero value. Our 
understanding is that a bigger radius, means a larger range of values,...

We found cases where several data points where part of a cluster, however the 
radius is returned as 0. We use the radius to evaluate how usable/relevant each 
cluster is to our use case, so getting accurate radius is important in our case.

We are using Mahout version 0.7


Details:

When calculating the KMeans clusters, the radius is calculated by the method: 
AbstractCluster.computeParamters()

  @Override
  public void computeParameters() {
    if (getS0() == 0) {
      return;
    }
    setNumObservations((long) getS0());
    setTotalObservations(getTotalObservations() + getNumObservations());
    setCenter(getS1().divide(getS0()));
    // compute the component stds
    if (getS0() > 1) {
      setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new 
SquareRootFunction()).divide(getS0()));
    }
    setS0(0);
    setS1(center.like());
    setS2(center.like());
  }

The important bit:      
setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new 
SquareRootFunction()).divide(getS0()));

Or, simplified/paraphrased:  (S2 * S0) minus (S1 * S1)... then divide using  a 
SquareRootFunction... (this last I don't think affects our scenario).


Data in our case:

S0=6.0
S1={1:150.0,0:54.0}
S2={1:3750.0,0:486.0}

getS2().times(getS0())={1:22500.0,0:2916.0}
getS1().times(getS1())={1:22500.0,0:2916.0}

And follows:  ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}

... Thus, we have a ZERO value for radius on this point/cluster.... And clearly 
this should not be the case. S1 and S2 represent 2 of the 6 "observations" (S0 
is number of observations)... and S1 and S2 are different data points,... so 
radius must be non-zero?

Is this a defect? Is there a flaw in our understanding/expectation of radius?

It seems the algorithm for radius here equates basically to: (A * B) - ( C * C)
It is easy to imagine many mathematical combinations where this equation will 
compute to ZERO. Can this be correct behavior?

Thanks for any thoughts / input!

Reply via email to