All, Two questions related to "Quick tour of text analysis using the Mahout command line"
1. metrics: When moving through the process of performing the cluster analysis one can use many different metrics. In the tour, the choice was made to use the Cosine metric. Is there any problems that can arise from using the cosine metric to define the clusters, but use tanimoto or euclid to dump the clusters? I have so far remained consistent in that once starting with Cosine, go all the way with cosine. When does it make sense to not do what I am doing? To be clear the current version of the tour does NOT specify that a metric should be used when dumping a cluster, so the default "Euclid" is used. 2. Parameters around canopy cluster: What are parameters t3 and t4? I know that they are optional reducers and t1 and t2 are used for them if t3 and t4 are not specified. https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering Lots of discussion about t1 and t2, but t3 and t4 are not covered in MiA either. Are these params that I should ignore for now? SCott
