Re: Identify "less similar" documents

Daniel McEnnis Wed, 20 Apr 2011 09:03:23 -0700

Claudia,

This gets into 'Goodness metrics' - measures of how good a cluster is.
 This metric is effectively the max distance metric - the maximum
distance from a vector to its cluster mean. It is a less common but
still useful metric. The most commonly used is average distance to the
cluster mean.


Daniel.

On Wed, Apr 20, 2011 at 10:35 AM, Claudia Grieco <[email protected]> wrote:
> Thanks again.
>
> Does the radius of the cluster give information on the tightness of the 
> cluster?
>
>
>
>
>
> Da: Ted Dunning [mailto:[email protected]]
> Inviato: martedì 19 aprile 2011 18.57
> A: [email protected]
> Cc: Claudia Grieco
> Oggetto: Re: Identify "less similar" documents
>
>
>
> Yes.  This makes sense.
>
>
>
> I think you might want to qualify X according to which cluster is closest.  
> Define a function that estimates the percentile distance for members of each 
> cluster.  There will be one function per cluster.
>
>
>
> Then define a function for each new point that is the percentile score based 
> on the distance to the nearest cluster.   The issue with what you suggest is 
> that some clusters are very tight and others very loose.
>
> On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <[email protected]> wrote:
>
> Thanks for the suggestion, I'm currently trying this hack:
> I take the documents of the training set and put in each cluster all the docs 
> of a certain category.
> I compute the centroid for each category cluster
> I compute the distance of each new document to all centroids (I'm using 
> CosineDistanceMeasure) and I identify as "outlier" the ones who have distance 
> more than X
>
> Do you think this makes sense?
>
>
>
>

Re: Identify "less similar" documents

Reply via email to