Note that cosine distance *is* Euclidean with the addition of document
length normalization.




On Fri, May 23, 2014 at 9:39 PM, David Noel <[email protected]> wrote:

> I found an interesting paper that I thought someone here might find
> helpful.
>
>
> http://www.milanmirkovic.com/wp-content/uploads/2012/10/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
>
> ABSTRACT: ... A wide variety of distance functions and similarity
> measures have been used for clustering, such as squared Euclidean
> distance, cosine similarity, and relative entropy. In this paper, we
> compare and analyze the effectiveness of these measures in partitional
> clustering for text document datasets. Our experiments utilize the
> standard K-means algorithm and we report results on seven text
> document datasets and five distance/similarity measures that have been
> most commonly used in text clustering.
>
> TL;DR: For text documents, favor Cosine, Jaccard/Tanimoto, or Pearson
> over Euclidean distance measures.
>

Reply via email to