Hi, I created a sequence file from a directory of text documents using the 'seqdirectory' in mahout. From the sequence file, a mahout vector file was created using the 'seq2sparse' command in mahout. Then I used k-means clustering to cluster the data. The command used is as follows. I am running the programs in a hadoop cluster with the HADOOP_HOME and HADOOP_CONF environment variable set.
./mahout kmeans -i /home/exthadoop1/mahout-vector/tfidf-vectors -o /home/exthadoop1/output -c clusters -dm org.apache.mahout.common. distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 5 -cl To read/analyze the output, I use the cluster dump utility. The clusterdump utility is invoked with the following options. ./mahout clusterdump --seqFileDir /home/exthadoop1/output/clusters-1 --pointsDir /home/exthadoop1/output/clusteredPoints --output cluster.txt In my cluster.txt file, I get the clustername, the number of points in it, the co-ordinates of the centroid, radius of the cluster and the weights and set of documents in the cluster. The problem is, the document is represented as points in an n-dimensional space. Is there any way to make clusterdump to output the unique document id also along with the co-ordinates of the document. It would be easier for me to see what are the documents in each cluster. I am also attaching my cluster.txt output. Regards, Murugaprabu Marimuthu
