I am trying to cluster documents stored in a lucene index using the
command line tools. How can I obtain the original document IDs from the
clustering output?
Here is the sequence of commands I am using:
./mahout lucene.vector --dir $index_path --output /tmp/mahout/vector
--field content --dictOut /tmp/mahout/dict --idField _uid -md 2 -w TFIDF
-x 70
./mahout canopy -i /tmp/mahout/vector -o /tmp/mahout_canopy -dm
org.apache.mahout.common.distance.CosineDistanceMeasure --t1 10 --t2 5
./mahout kmeans -i /tmp/mahout/vector -c
/tmp/mahout_canopy/clusters-0-final/part-r-00000 -o /tmp/mahout_kmeans
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 20 -x 20
-cd 0.1
./mahout clusterdump -dt text -d /tmp/mahout/dict -s
/tmp/mahout_kmeans/clusters-1-final/ -b 20 -n 20
A similar question was asked on this thread [1], but I did not see a
resolution. Thanks in advance for your help!
- Ben
[1]
http://mail-archives.apache.org/mod_mbox/mahout-user/201204.mbox/%3cca+y9ocwgs2se7doqqrse3p+qe5gvxct8xutucfdzvgkjkpo...@mail.gmail.com%3E