I just read this paper and it it very nicely written up. There are a few unfortunate omissions:
1) cosine is equivalent to Euclidean with the addition of document and centroid normalization. 2) the entropy measure given appears to be an ad hoc partial derivation of mutual information, but this is not mentioned, nor are the differences examined 3) the tf-idf measure used uses straight tf. It is usually better to use log(tf) or sqrt(tf). This is not examined. 4) the same number of clusters as target categories is used. Commonly, clustering is used as a feature for classification and there is no rationale in that case for the number of clusters to be the same as the number of target categories. 5) if (4) is accepted, then mutual information is immediately better than the entropy measure shown since it is normalizes away the number of clusters. On Fri, May 23, 2014 at 9:39 PM, David Noel <[email protected]> wrote: > I found an interesting paper that I thought someone here might find > helpful. > > > http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf > > ABSTRACT: ... A wide variety of distance functions and similarity > measures have been used for clustering, such as squared Euclidean > distance, cosine similarity, and relative entropy. In this paper, we > compare and analyze the effectiveness of these measures in partitional > clustering for text document datasets. Our experiments utilize the > standard K-means algorithm and we report results on seven text > document datasets and five distance/similarity measures that have been > most commonly used in text clustering. > > TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson > over Euclidean distance measures. >
