Re: clusterdump lucene document ID

Grant Ingersoll Mon, 11 Jun 2012 08:07:30 -0700

It should be creating a NamedVector using what is passed in from the idField, 
in your case _uid.  That field must be stored.  If that field is null, then it 
uses the internal Lucene id.  Those named vectors should be preserved across 
all operations.  What's your output from your last step look like?



On May 11, 2012, at 12:30 AM, Benjamin Busjaeger wrote:

> I am trying to cluster documents stored in a lucene index using the command 
> line tools. How can I obtain the original document IDs from the clustering 
> output?
> 
> 
> Here is the sequence of commands I am using:
> 
> ./mahout lucene.vector --dir $index_path --output /tmp/mahout/vector --field 
> content --dictOut /tmp/mahout/dict --idField _uid -md 2 -w TFIDF -x 70
> 
> ./mahout canopy -i /tmp/mahout/vector -o /tmp/mahout_canopy -dm 
> org.apache.mahout.common.distance.CosineDistanceMeasure --t1 10 --t2 5
> 
> ./mahout kmeans -i /tmp/mahout/vector -c 
> /tmp/mahout_canopy/clusters-0-final/part-r-00000 -o /tmp/mahout_kmeans -dm 
> org.apache.mahout.common.distance.CosineDistanceMeasure -k 20 -x 20 -cd 0.1
> 
> ./mahout clusterdump -dt text -d /tmp/mahout/dict -s 
> /tmp/mahout_kmeans/clusters-1-final/ -b 20 -n 20
> 
> 
> A similar question was asked on this thread [1], but I did not see a 
> resolution. Thanks in advance for your help!
> 
> - Ben
> 
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201204.mbox/%3cca+y9ocwgs2se7doqqrse3p+qe5gvxct8xutucfdzvgkjkpo...@mail.gmail.com%3E
>  

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: clusterdump lucene document ID

Reply via email to