Note that cosine distance *is* Euclidean with the addition of document length normalization.
On Fri, May 23, 2014 at 9:39 PM, David Noel <[email protected]> wrote: > I found an interesting paper that I thought someone here might find > helpful. > > > http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf > > ABSTRACT: ... A wide variety of distance functions and similarity > measures have been used for clustering, such as squared Euclidean > distance, cosine similarity, and relative entropy. In this paper, we > compare and analyze the effectiveness of these measures in partitional > clustering for text document datasets. Our experiments utilize the > standard K-means algorithm and we report results on seven text > document datasets and five distance/similarity measures that have been > most commonly used in text clustering. > > TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson > over Euclidean distance measures. >
