I am trying to cluster documents stored in a lucene index using the command line tools. How can I obtain the original document IDs from the clustering output?

Here is the sequence of commands I am using:

./mahout lucene.vector --dir $index_path --output /tmp/mahout/vector --field content --dictOut /tmp/mahout/dict --idField _uid -md 2 -w TFIDF -x 70

./mahout canopy -i /tmp/mahout/vector -o /tmp/mahout_canopy -dm org.apache.mahout.common.distance.CosineDistanceMeasure --t1 10 --t2 5

./mahout kmeans -i /tmp/mahout/vector -c /tmp/mahout_canopy/clusters-0-final/part-r-00000 -o /tmp/mahout_kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 20 -x 20 -cd 0.1

./mahout clusterdump -dt text -d /tmp/mahout/dict -s /tmp/mahout_kmeans/clusters-1-final/ -b 20 -n 20


A similar question was asked on this thread [1], but I did not see a resolution. Thanks in advance for your help!

- Ben


[1] http://mail-archives.apache.org/mod_mbox/mahout-user/201204.mbox/%3cca+y9ocwgs2se7doqqrse3p+qe5gvxct8xutucfdzvgkjkpo...@mail.gmail.com%3E

Reply via email to