You'll want to add the -nv option to seq2sparse to get NamedVectors out and add the -cl argument to k-means to get the clustered documents. Then the clusterdump should give you what you are seeking.
-----Original Message----- From: Yosep Kim [mailto:[email protected]] Sent: Thursday, August 11, 2011 3:43 PM To: [email protected] Subject: How to convert Hello, Everyone! This is Yosep Kim, and I just started playing with Mahout. I successfully installed it on my box and got a example data clustered using a K-Means clustering algorithm. My input data was all text documents (i.e. new articles). I ran a clusterdump command, I get some cool information. However, I was not able to find a way to translate this back to the original document. It looks like the algorithm created clusters based on all the words inside of documents. Did I understand this correctly? How can I create clusters based on documents so I can see that "document1.txt and document2.txt are in Cluster 1"? I'd appreciate your help!! Thanks. :CL-16397{n=1032 c=[0:0.125, 0.5:0.019, 0.8m:0.014, 00:0.096, 0000:0.008, 001:0.015, 00139:0.014, 001 Top Terms: c => 2.458502088406289 software => 2.375095306671867 java => 2.2093305677868598 project => 1.989917316871096 application => 1.957329582567363 using => 1.916300386652466 web => 1.9046723985856817 development => 1.8707247066867443 By the way, Mahout is way cool, and I can't wait to be part of this "movement". Yosep
