Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

Jeff Eastman Tue, 14 May 2013 10:10:31 -0700

Hi Erinn,

The radius calculation in KMeans and other clustering algorithms uses arunning sums algorithm (see RunningSumsGaussianAccumulator) and theradius is really the standard deviation produced by this method. In thismethod (as you likely know) s0 is the number of points observed, s1 isthe sum of those points and s2 is the sum of the squares of thosepoints. This algorithm has some documented roundoff issues but yourproblem does not look like roundoff. You have not included the points inyour example, but if they are all the same value for a cluster then Iwould expect their std and radius to be zero.


Jeff


On 5/9/13 9:17 PM, Erinn Schorsch wrote:

I am working on an application using mahout KMeans clustering andclassification. We use Canopy clusters to seed KMeans although I don'tbelieve this to be relevant for this issue.
Has anyone else experienced this issue? (details follow). Does anyonehave any insight on whether radius=0 will affect if KMeans convergenceis arrived at? (and therefore drive premature convergence whenmiscalculated to be 0).
We have discovered what appears to me a defect in how the radius valueis calculated for the clusters that Mahout/Kmeans generates.Generally, we are expecting that when a cluster includes data points(observations) which vary from the center-point, then radius should besome non-zero value. Our understanding is that a bigger radius, meansa larger range of values,...
We found cases where several data points where part of a cluster,however the radius is returned as 0. We use the radius to evaluate howusable/relevant each cluster is to our use case, so getting accurateradius is important in our case.
We are using Mahout version 0.7

Details:
When calculating the KMeans clusters, the radius is calculated by themethod: AbstractCluster.computeParamters()
  @Override

  public void computeParameters() {

    if (getS0() == 0) {

      return;

    }

    setNumObservations((long) getS0());

setTotalObservations(getTotalObservations() + getNumObservations());

setCenter(getS1().divide(getS0()));

    // compute the component stds

    if (getS0() > 1) {
*setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(newSquareRootFunction()).divide(getS0()));*
    }

    setS0(0);

    setS1(center.like());

    setS2(center.like());

  }
The importantbit:*setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(newSquareRootFunction()).divide(getS0()));*
Or, simplified/paraphrased: (S2 * S0) minus (S1 * S1)... then divideusing a SquareRootFunction... (this last I don't think affects ourscenario).
Data in our case:

S0=6.0

S1={1:150.0,0:54.0}

S2={1:3750.0,0:486.0}

getS2().times(getS0())={1:22500.0,0:2916.0}

getS1().times(getS1())={1:22500.0,0:2916.0}

And follows:  ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}
... Thus, we have a ZERO value for radius on this point/cluster....And clearly this should not be the case. S1 and S2 represent 2 of the6 "observations" (S0 is number of observations)... and S1 and S2 aredifferent data points,... so radius must be non-zero?
Is this a defect? Is there a flaw in our understanding/expectation ofradius?
It seems the algorithm for radius here equates basically to: (A * B)-- ( C * C)
It is easy to imagine many mathematical combinations where thisequation will compute to ZERO. Can this be correct behavior?
Thanks for any thoughts / input!

Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

Reply via email to