Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

Jeff Eastman Wed, 15 May 2013 10:16:04 -0700

What you have observed is correct. During the final iteration, pointsare observed by each cluster and these observations are used tocalculate the new cluster center and radius. As that center moves lessthan the convergence delta from the previous center, the iterationsstop. During the subsequent classification phase, each point is assignedto its most likely cluster and this assignment may not always be to thesame cluster due to this final cluster center movement.

Decreasing the convergence delta and thus running more iterations mayhelp to resolve this problem; however, there are situations with KMeanswhere the end state can oscillate between two or more very similarclusterings. I think the only way to predictably use apost-classification radius is to recalculate it at the end.





On 5/14/13 2:19 PM, Erinn Schorsch wrote:

Thanks Jeff.
After some additional investigation on our side, we find themath/std-deviation calculation to be correct, and that our data doeshave a radius of 0 (at KMeans Cluster Identification time)... allpoints were of the same value.
The problem however is that we run the KMeans Classification processsubsequently,... and it returns a set of vectors classified to thecluster in question, which have different values than the set fromClusterIdentification time. These points are not all the same value.The reason for this, is that at the end of ClusterIdentification, thecenter/radius are calculated... using this new center for theClassification run, the measurements for each vector vary from thelast iteration of ClusterIdentification, and produce a differentcategorization of the data,... so the radius from the final iterationof ClusterIdentification does not represent the std-deviation of theclassification results.
*From:*Jeff Eastman [mailto:[email protected]]
*Sent:* Tuesday, May 14, 2013 11:10 AM
*To:* Erinn Schorsch
*Cc:* [email protected]
*Subject:* Re: Problems with KMeans Clustering - Radius calculationreturns incorrect ZERO value in some cases.
Hi Erinn,
The radius calculation in KMeans and other clustering algorithms usesa running sums algorithm (see RunningSumsGaussianAccumulator) and theradius is really the standard deviation produced by this method. Inthis method (as you likely know) s0 is the number of points observed,s1 is the sum of those points and s2 is the sum of the squares ofthose points. This algorithm has some documented roundoff issues butyour problem does not look like roundoff. You have not included thepoints in your example, but if they are all the same value for acluster then I would expect their std and radius to be zero.
Jeff

On 5/9/13 9:17 PM, Erinn Schorsch wrote:

    I am working on an application using mahout KMeans clustering and
    classification. We use Canopy clusters to seed KMeans although I
    don't believe this to be relevant for this issue.

    Has anyone else experienced this issue? (details follow).  Does
    anyone have any insight on whether radius=0 will affect if KMeans
    convergence is arrived at? (and therefore drive premature
    convergence when miscalculated to be 0).

    We have discovered what appears to me a defect in how the radius
    value is calculated for the clusters that Mahout/Kmeans generates.
    Generally, we are expecting that when a cluster includes data
    points (observations) which vary from the center-point, then
    radius should be some non-zero value. Our understanding is that a
    bigger radius, means a larger range of values,...

    We found cases where several data points where part of a cluster,
    however the radius is returned as 0. We use the radius to evaluate
    how usable/relevant each cluster is to our use case, so getting
    accurate radius is important in our case.

    We are using Mahout version 0.7

    Details:

    When calculating the KMeans clusters, the radius is calculated by
    the method: AbstractCluster.computeParamters()

      @Override

      public void computeParameters() {

        if (getS0() == 0) {

          return;

        }

        setNumObservations((long) getS0());

    setTotalObservations(getTotalObservations() + getNumObservations());

    setCenter(getS1().divide(getS0()));

        // compute the component stds

        if (getS0() > 1) {

    *setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
    SquareRootFunction()).divide(getS0()));*

        }

        setS0(0);

        setS1(center.like());

        setS2(center.like());

      }

    The important
    
bit:*setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
    SquareRootFunction()).divide(getS0()));*

    Or, simplified/paraphrased:  (S2 * S0) minus (S1 * S1)... then
    divide using  a SquareRootFunction... (this last I don't think
    affects our scenario).

    Data in our case:

    S0=6.0

    S1={1:150.0,0:54.0}

    S2={1:3750.0,0:486.0}

    getS2().times(getS0())={1:22500.0,0:2916.0}

    getS1().times(getS1())={1:22500.0,0:2916.0}

    And follows:  ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}

    ... Thus, we have a ZERO value for radius on this
    point/cluster.... And clearly this should not be the case. S1 and
    S2 represent 2 of the 6 "observations" (S0 is number of
    observations)... and S1 and S2 are different data points,... so
    radius must be non-zero?

    Is this a defect? Is there a flaw in our understanding/expectation
    of radius?

    It seems the algorithm for radius here equates basically to: (A *
    B) -- ( C * C)

    It is easy to imagine many mathematical combinations where this
    equation will compute to ZERO. Can this be correct behavior?

    Thanks for any thoughts / input!

Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

Reply via email to