Hi,

I want to do canopy clustering before kmeans over strings. This would help
to automatically decide number of clusters.
How to do canopy clustering over set of strings ?

Reuters clustering examples gives steps to use kmeans on such documents,
however how to have canopy before kmeans ?

https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html

Steps followed are:

mahout seqdirectory --input /user/hadoop/train-set --output
/user/hadoop/train-set-seqfiles

mahout seq2sparse \
   -i /user/hadoop/train-set-seqfiles/ \
   -o /user/hadoop/train-set-vectors/ \
   -ow -chunk 100 \
   -x 90 \
   -seq \
   -ml 50 \
   -n 2 \
   -nv

mahout canopy \
    -i /user/hadoop/train-set-vectors/ \
    -o /user/hadoop/train-set-canopy-centroids \
    -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure \
    -t1 0.001 \
    -t2 0.002


Exception in thread "main" java.io.FileNotFoundException: File does not
exist: hdfs://hadoop-master:54310/user/hadoop/*
train-set-vectors/df-count/data*

It seems I am missing some of the steps for the formats expected by Canopy.

I appreciate your help !

Thanks,
Rajesh

Reply via email to