The best way to evaluate a cluster really depends on what your purpose is.

My own purpose is typically to use the clustering as a description of the
probability distribution of data.

For that purpose, the best evaluation is distance to centroids for held-out
data.  The use of held-out data is critical here since otherwise you could
just put a single cluster at every data point and get zero distance for the
original data.  For held-out data, of course, the story would be different.

This view of things is very good from the standpoint of machine learning
and data compression, but might be less useful for certain purposes that
have to do with explanation of data in human readable form.  My experience
is that it is common for a clustering algorithm to be very good as a
probability distribution description but quite bad for human inspection.

My own tendency would be to adapt the outline you gave to work on held-out
data instead of the original training data.

On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <[email protected]>wrote:

> Hi all,
>
> I want to find all the vectors within a cluster and then find the distance
> between them and every other vector within a cluster, in hopes this will
> give me a good idea of how similar each vector within a cluster is as well
> as identify outlier vectors.
>
> So there are 2 things I want to ask.
>
> 1. Is this a sensible approach to evaluating the cluster quality?
>
> 2. Is the correct file to get this info from the
> clusteredPoints/parts-m-00000 file?
>
> Thanks,
> Chris
>
>
>

Reply via email to