In my project of text clustering I used the Euclidean distance as measurement method. I wrote a method which calculated the mean distance between all the pairs of vectors (documents) and used this mean as T2, and for T1 I used mean*2. This approach worked really good for me, giving a reasonably number of clusters in various corpus.
On Tue, May 15, 2012 at 10:45 AM, Robert Stewart <[email protected]>wrote: > I am trying to run canopy clustering on vectors extracted from lucene > index. I want to use CosineDistanceMeasure. How do I know what > appropriate values to use for t1 and t2 distance threshold? I would assume > that Cosine distance measure would return "distances" as a range from 0.0 > to 1.0 but that seems not the case, so how do I know what the potential > distance ranges are to pick t1 and t2 (other than many trial and errors)? > > Thanks > Bob
