Hi, I want to do canopy clustering before kmeans over strings. This would help to automatically decide number of clusters. How to do canopy clustering over set of strings ?
Reuters clustering examples gives steps to use kmeans on such documents, however how to have canopy before kmeans ? https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html Steps followed are: mahout seqdirectory --input /user/hadoop/train-set --output /user/hadoop/train-set-seqfiles mahout seq2sparse \ -i /user/hadoop/train-set-seqfiles/ \ -o /user/hadoop/train-set-vectors/ \ -ow -chunk 100 \ -x 90 \ -seq \ -ml 50 \ -n 2 \ -nv mahout canopy \ -i /user/hadoop/train-set-vectors/ \ -o /user/hadoop/train-set-canopy-centroids \ -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure \ -t1 0.001 \ -t2 0.002 Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://hadoop-master:54310/user/hadoop/* train-set-vectors/df-count/data* It seems I am missing some of the steps for the formats expected by Canopy. I appreciate your help ! Thanks, Rajesh
