Re: Canopy clustering question

Jeff Eastman Thu, 14 Oct 2010 22:42:23 -0700

You are correct that T2 is the prime driver in Canopy, but T1 isimportant in determining how many nearby points will be considered inthe final centroid computation. if you look at the CanopyReducer, youwill note that it is calling canopy.computeParameters() before writingout the canopy. This has the effect of recomputing the posteriorstatistics of all the points that were within T1 of its original center,and setting the center to the mean thereof. Note also that theCanopyClusterer.addPointToCanopies() is observing the point in theposterior statistics based upon the T1 distance.

The hard clustering step you describe uses the same distance measure andcanopy centers to assign points to clusters. It is only possible to usea different/better distance measure when calling the driver methods fromJava. The CLI interface uses the same distance measure for both phases.

You are correct that the clustered points from canopy are not needed ifyou are feeding the canopies to k-means as initial cluster centers. Butagain these centers are influenced by T1. And T2 of course so they'reboth important for different reasons. I think you are mostly on theright track so hope this helps.



On 10/14/10 10:03 PM, gabeweb wrote:

I've been using canopy clustering to generate initial centroids for k-means
clustering.  But in this case, is t1 actually doing anything?  Because
canopy clustering keeps going until all points are within t2 of some canopy
center.  If a point is within t1 of a canopy center, it's placed in that
canopy, sure, but it's also kept in the list of "not yet within t2 of some
canopy center" points, so this step seems vacuous.

It seems to me that the case in which t1 is not vacuous is when performing
the second step of canopy clustering as described by McCallum et al., which
is the hard clustering using a better distance metric.  In this case, the
canopies generated by t1 are used:  when points are assigned to single
clusters, only the clusters whose centers are within t1 of a point are
considered.

However, when using canopy clustering to generate centroids for k-means, we
don't perform the second step of canopy clustering because we don't need the
hard decisions for each point; we just need the canopy centers.  K-means
clustering determines the actual cluster membership (along with changing the
cluster centers, obviously).

So am I right about t1 not being relevant in the case of canopy clustering
for centroids only?  Or if that's not right, where am I going wrong?
Thanks.

Re: Canopy clustering question

Reply via email to