Just use a frequency weighted cosine distance and index words and anomalously common cooccurrences. That gives you pretty much all you are asking for.
Also, your progressive increase approach sounds a lot like k-means. You might take a look to see if that could help. On Wed, Jul 20, 2011 at 11:33 AM, Marco Turchi <[email protected]>wrote: > The problem of using the cosine similarity being an Euclidean distance, if > I'm not wrong, is that it groups the document inside a sphere, so I was > wondering to use some the Mahalonobis distance, because it allows me to > group the document inside an ellipsoid. This means to compute the > covariance matrix of the seed documents. > I looked around searching for other approaches, but at the moment I'm > stalled on the Mahalonobis distance and its covariance matrix. > Any suggestions are very welcome >
