Yes.  that is the idea.

But I would drop the average of averages.  Use squared distance.

Just average all (or enough to get an estimate) of the distances to the
nearest centroid.

This is proportional to log-likelihood (with an offset) for the mixture of
Gaussian model that underlies k-means clustering.

See this paper for a use of mean squared distance to nearest cluster.


On Fri, May 24, 2013 at 9:46 AM, Pat Ferrel <[email protected]> wrote:

> I'm trying to automate something like a hierarchical clustering and so
> looking for a good quality metric. I can see no way to automate from the
> numbers I just got but it's a start. It was for a very small data set.
>
> You mention looking at intra-cluster average distance with held out data.
> Held-out, I assume, means it was not used to calculate centroids or in
> determining cluster membership. Are you proposing remeasuring the average
> distance from the closest centroid for these held-out docs? Averaging
> together the ones that are closest to the same centroid, then averaging the
> averages for all clusters?
>
> I don't think I've heard of this before. Seems interesting is there a
> paper?
>
> On May 21, 2013, at 9:53 PM, Ted Dunning <[email protected]> wrote:
>
> On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel <[email protected]> wrote:
>
> > For this sample it looks like about 20-40 clusters is "best"? Looking at
> > the results for k=40 by eyeball they do seem pretty good.
>
>
> It is really hard to tell with these numbers.  IN spite of their heritage,
> these scaled average distances are kind of debatable as things to compare,
> if only because they are scaled differently.
>
> My own tendency is to prefer to use unscaled intra-cluster average
> distance.  This should monotonically decrease as k increases.  The
> interesting question (for me) is what the same average is for held-out
> data.
>
> This measure of quality is focused around the use of clustering as a
> feature for downstream modeling, not necessarily for human consumption.
>
>

Reply via email to