Claudia, This gets into 'Goodness metrics' - measures of how good a cluster is. This metric is effectively the max distance metric - the maximum distance from a vector to its cluster mean. It is a less common but still useful metric. The most commonly used is average distance to the cluster mean.
Daniel. On Wed, Apr 20, 2011 at 10:35 AM, Claudia Grieco <[email protected]> wrote: > Thanks again. > > Does the radius of the cluster give information on the tightness of the > cluster? > > > > > > Da: Ted Dunning [mailto:[email protected]] > Inviato: martedì 19 aprile 2011 18.57 > A: [email protected] > Cc: Claudia Grieco > Oggetto: Re: Identify "less similar" documents > > > > Yes. This makes sense. > > > > I think you might want to qualify X according to which cluster is closest. > Define a function that estimates the percentile distance for members of each > cluster. There will be one function per cluster. > > > > Then define a function for each new point that is the percentile score based > on the distance to the nearest cluster. The issue with what you suggest is > that some clusters are very tight and others very loose. > > On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <[email protected]> wrote: > > Thanks for the suggestion, I'm currently trying this hack: > I take the documents of the training set and put in each cluster all the docs > of a certain category. > I compute the centroid for each category cluster > I compute the distance of each new document to all centroids (I'm using > CosineDistanceMeasure) and I identify as "outlier" the ones who have distance > more than X > > Do you think this makes sense? > > > >
