Also, if only a single point is assigned to the cluster, the radius of the cluster is, by definition, zero.
That isn't a bug. On Wed, May 15, 2013 at 10:15 AM, Jeff Eastman < [email protected]> wrote: > What you have observed is correct. During the final iteration, points are > observed by each cluster and these observations are used to calculate the > new cluster center and radius. As that center moves less than the > convergence delta from the previous center, the iterations stop. During the > subsequent classification phase, each point is assigned to its most likely > cluster and this assignment may not always be to the same cluster due to > this final cluster center movement. > > Decreasing the convergence delta and thus running more iterations may help > to resolve this problem; however, there are situations with KMeans where > the end state can oscillate between two or more very similar clusterings. I > think the only way to predictably use a post-classification radius is to > recalculate it at the end. > > > > > On 5/14/13 2:19 PM, Erinn Schorsch wrote: > > Thanks Jeff.**** > > ** ** > > After some additional investigation on our side, we find the > math/std-deviation calculation to be correct, and that our data does have a > radius of 0 (at KMeans Cluster Identification time)… all points were of the > same value.**** > > ** ** > > The problem however is that we run the KMeans Classification process > subsequently,… and it returns a set of vectors classified to the cluster in > question, which have different values than the set from > ClusterIdentification time. These points are not all the same value. The > reason for this, is that at the end of ClusterIdentification, the > center/radius are calculated… using this new center for the Classification > run, the measurements for each vector vary from the last iteration of > ClusterIdentification, and produce a different categorization of the data,… > so the radius from the final iteration of ClusterIdentification does not > represent the std-deviation of the classification results.**** > > ** ** > > ** ** > > ** ** > > *From:* Jeff Eastman > [mailto:[email protected]<[email protected]>] > > *Sent:* Tuesday, May 14, 2013 11:10 AM > *To:* Erinn Schorsch > *Cc:* [email protected] > *Subject:* Re: Problems with KMeans Clustering - Radius calculation > returns incorrect ZERO value in some cases.**** > > ** ** > > Hi Erinn, > > The radius calculation in KMeans and other clustering algorithms uses a > running sums algorithm (see RunningSumsGaussianAccumulator) and the radius > is really the standard deviation produced by this method. In this method > (as you likely know) s0 is the number of points observed, s1 is the sum of > those points and s2 is the sum of the squares of those points. This > algorithm has some documented roundoff issues but your problem does not > look like roundoff. You have not included the points in your example, but > if they are all the same value for a cluster then I would expect their std > and radius to be zero. > > Jeff > > **** > > On 5/9/13 9:17 PM, Erinn Schorsch wrote:**** > > I am working on an application using mahout KMeans clustering and > classification. We use Canopy clusters to seed KMeans although I don’t > believe this to be relevant for this issue.**** > > **** > > Has anyone else experienced this issue? (details follow). Does anyone > have any insight on whether radius=0 will affect if KMeans convergence is > arrived at? (and therefore drive premature convergence when miscalculated > to be 0).**** > > **** > > We have discovered what appears to me a defect in how the radius value is > calculated for the clusters that Mahout/Kmeans generates. Generally, we are > expecting that when a cluster includes data points (observations) which > vary from the center-point, then radius should be some non-zero value. Our > understanding is that a bigger radius, means a larger range of values,… ** > ** > > **** > > We found cases where several data points where part of a cluster, however > the radius is returned as 0. We use the radius to evaluate how > usable/relevant each cluster is to our use case, so getting accurate radius > is important in our case.**** > > **** > > We are using Mahout version 0.7**** > > **** > > **** > > Details:**** > > **** > > When calculating the KMeans clusters, the radius is calculated by the > method: AbstractCluster.computeParamters()**** > > **** > > @Override**** > > public void computeParameters() {**** > > if (getS0() == 0) {**** > > return;**** > > }**** > > setNumObservations((long) getS0());**** > > setTotalObservations(getTotalObservations() + getNumObservations());** > ** > > setCenter(getS1().divide(getS0()));**** > > // compute the component stds**** > > if (getS0() > 1) {**** > > * > setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new > SquareRootFunction()).divide(getS0()));***** > > }**** > > setS0(0);**** > > setS1(center.like());**** > > setS2(center.like());**** > > }**** > > **** > > The important bit:* > setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new > SquareRootFunction()).divide(getS0()));***** > > **** > > Or, simplified/paraphrased: (S2 * S0) minus (S1 * S1)… then divide using > a SquareRootFunction… (this last I don’t think affects our scenario).**** > > **** > > **** > > Data in our case:**** > > **** > > S0=6.0**** > > S1={1:150.0,0:54.0}**** > > S2={1:3750.0,0:486.0}**** > > **** > > getS2().times(getS0())={1:22500.0,0:2916.0}**** > > getS1().times(getS1())={1:22500.0,0:2916.0}**** > > **** > > And follows: ({1:22500.0,0:2916.0}).minus({1:22500.0,0:2916.0}) ={}**** > > **** > > … Thus, we have a ZERO value for radius on this point/cluster…. And > clearly this should not be the case. S1 and S2 represent 2 of the 6 > “observations” (S0 is number of observations)… and S1 and S2 are different > data points,… so radius must be non-zero?**** > > **** > > Is this a defect? Is there a flaw in our understanding/expectation of > radius?**** > > **** > > It seems the algorithm for radius here equates basically to: (A * B) – ( C > * C)**** > > It is easy to imagine many mathematical combinations where this equation > will compute to ZERO. Can this be correct behavior?**** > > **** > > Thanks for any thoughts / input!**** > > ** ** > > >
