Just use a frequency weighted cosine distance and index words and
anomalously common cooccurrences.  That gives you pretty much all you are
asking for.

Also, your progressive increase approach sounds a lot like k-means.  You
might take a look to see if that could help.

On Wed, Jul 20, 2011 at 11:33 AM, Marco Turchi <[email protected]>wrote:

> The problem of using the cosine similarity being an Euclidean distance, if
> I'm not wrong, is that it groups the document inside a sphere, so I was
> wondering to use some the Mahalonobis distance, because it allows me to
> group the document inside an ellipsoid.  This means to compute the
> covariance matrix of the seed documents.
> I looked around searching for other approaches, but at the moment I'm
> stalled on the Mahalonobis distance and its covariance matrix.
> Any suggestions are very welcome
>

Reply via email to