Re: Identify "less similar" documents

Ted Dunning Tue, 19 Apr 2011 09:57:23 -0700

Yes.  This makes sense.

I think you might want to qualify X according to which cluster is closest.
 Define a function that estimates the percentile distance for members of
each cluster.  There will be one function per cluster.

Then define a function for each new point that is the percentile score based
on the distance to the nearest cluster.   The issue with what you suggest is
that some clusters are very tight and others very loose.

On Tue, Apr 19, 2011 at 2:55 AM, Claudia Grieco <[email protected]>wrote:

> Thanks for the suggestion, I'm currently trying this hack:
> I take the documents of the training set and put in each cluster all the
> docs of a certain category.
> I compute the centroid for each category cluster
> I compute the distance of each new document to all centroids (I'm using
> CosineDistanceMeasure) and I identify as "outlier" the ones who have
> distance more than X
>
> Do you think this makes sense?
>

Re: Identify "less similar" documents

Reply via email to