Thanks again.

Does the radius of the cluster give information on the tightness of the cluster?

 

 

Da: Ted Dunning [mailto:[email protected]] 
Inviato: martedì 19 aprile 2011 18.57
A: [email protected]
Cc: Claudia Grieco
Oggetto: Re: Identify "less similar" documents

 

Yes.  This makes sense.

 

I think you might want to qualify X according to which cluster is closest.  
Define a function that estimates the percentile distance for members of each 
cluster.  There will be one function per cluster.

 

Then define a function for each new point that is the percentile score based on 
the distance to the nearest cluster.   The issue with what you suggest is that 
some clusters are very tight and others very loose.

On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <[email protected]> wrote:

Thanks for the suggestion, I'm currently trying this hack:
I take the documents of the training set and put in each cluster all the docs 
of a certain category.
I compute the centroid for each category cluster
I compute the distance of each new document to all centroids (I'm using 
CosineDistanceMeasure) and I identify as "outlier" the ones who have distance 
more than X

Do you think this makes sense?

 

Reply via email to