Hi Erinn,
The radius calculation in KMeans and other clustering algorithms uses a
running sums algorithm (see RunningSumsGaussianAccumulator) and the
radius is really the standard deviation produced by this method. In this
method (as you likely know) s0 is the number of points observed, s1 is
the sum of those points and s2 is the sum of the squares of those
points. This algorithm has some documented roundoff issues but your
problem does not look like roundoff. You have not included the points in
your example, but if they are all the same value for a cluster then I
would expect their std and radius to be zero.
Jeff
On 5/9/13 9:17 PM, Erinn Schorsch wrote:
I am working on an application using mahout KMeans clustering and
classification. We use Canopy clusters to seed KMeans although I don't
believe this to be relevant for this issue.
Has anyone else experienced this issue? (details follow). Does anyone
have any insight on whether radius=0 will affect if KMeans convergence
is arrived at? (and therefore drive premature convergence when
miscalculated to be 0).
We have discovered what appears to me a defect in how the radius value
is calculated for the clusters that Mahout/Kmeans generates.
Generally, we are expecting that when a cluster includes data points
(observations) which vary from the center-point, then radius should be
some non-zero value. Our understanding is that a bigger radius, means
a larger range of values,...
We found cases where several data points where part of a cluster,
however the radius is returned as 0. We use the radius to evaluate how
usable/relevant each cluster is to our use case, so getting accurate
radius is important in our case.
We are using Mahout version 0.7
Details:
When calculating the KMeans clusters, the radius is calculated by the
method: AbstractCluster.computeParamters()
@Override
public void computeParameters() {
if (getS0() == 0) {
return;
}
setNumObservations((long) getS0());
setTotalObservations(getTotalObservations() + getNumObservations());
setCenter(getS1().divide(getS0()));
// compute the component stds
if (getS0() > 1) {
*setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
SquareRootFunction()).divide(getS0()));*
}
setS0(0);
setS1(center.like());
setS2(center.like());
}
The important
bit:*setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new
SquareRootFunction()).divide(getS0()));*
Or, simplified/paraphrased: (S2 * S0) minus (S1 * S1)... then divide
using a SquareRootFunction... (this last I don't think affects our
scenario).
Data in our case:
S0=6.0
S1={1:150.0,0:54.0}
S2={1:3750.0,0:486.0}
getS2().times(getS0())={1:22500.0,0:2916.0}
getS1().times(getS1())={1:22500.0,0:2916.0}
And follows: ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}
... Thus, we have a ZERO value for radius on this point/cluster....
And clearly this should not be the case. S1 and S2 represent 2 of the
6 "observations" (S0 is number of observations)... and S1 and S2 are
different data points,... so radius must be non-zero?
Is this a defect? Is there a flaw in our understanding/expectation of
radius?
It seems the algorithm for radius here equates basically to: (A * B)
-- ( C * C)
It is easy to imagine many mathematical combinations where this
equation will compute to ZERO. Can this be correct behavior?
Thanks for any thoughts / input!