You are correct that T2 is the prime driver in Canopy, but T1 is important in determining how many nearby points will be considered in the final centroid computation. if you look at the CanopyReducer, you will note that it is calling canopy.computeParameters() before writing out the canopy. This has the effect of recomputing the posterior statistics of all the points that were within T1 of its original center, and setting the center to the mean thereof. Note also that the CanopyClusterer.addPointToCanopies() is observing the point in the posterior statistics based upon the T1 distance.

The hard clustering step you describe uses the same distance measure and canopy centers to assign points to clusters. It is only possible to use a different/better distance measure when calling the driver methods from Java. The CLI interface uses the same distance measure for both phases.

You are correct that the clustered points from canopy are not needed if you are feeding the canopies to k-means as initial cluster centers. But again these centers are influenced by T1. And T2 of course so they're both important for different reasons. I think you are mostly on the right track so hope this helps.


On 10/14/10 10:03 PM, gabeweb wrote:
I've been using canopy clustering to generate initial centroids for k-means
clustering.  But in this case, is t1 actually doing anything?  Because
canopy clustering keeps going until all points are within t2 of some canopy
center.  If a point is within t1 of a canopy center, it's placed in that
canopy, sure, but it's also kept in the list of "not yet within t2 of some
canopy center" points, so this step seems vacuous.

It seems to me that the case in which t1 is not vacuous is when performing
the second step of canopy clustering as described by McCallum et al., which
is the hard clustering using a better distance metric.  In this case, the
canopies generated by t1 are used:  when points are assigned to single
clusters, only the clusters whose centers are within t1 of a point are
considered.

However, when using canopy clustering to generate centroids for k-means, we
don't perform the second step of canopy clustering because we don't need the
hard decisions for each point; we just need the canopy centers.  K-means
clustering determines the actual cluster membership (along with changing the
cluster centers, obviously).

So am I right about t1 not being relevant in the case of canopy clustering
for centroids only?  Or if that's not right, where am I going wrong?
Thanks.

Reply via email to