The 'data' folder in output contains the synthetic control dataset converted to Mahout Vector format; it is the input to the clustering algorithm which you select in the script. After doing the clustering, the script runs ClusterDumper on the clusters-5 models and pulls the 'clusteredPoints' from that directory. Ldatopics will only work with the LDA outputs; use ClusterDumper with canopy, kmeans, fuzzyk, dirichlet and mean shift.
-----Original Message----- From: wine lover [mailto:[email protected]] Sent: Friday, June 24, 2011 12:24 PM To: [email protected] Subject: How to read/analyze the clustered result Hello Everyone, I just installed the mahout and hadoop, and began to run the listed examples. I followed the example of "clustering of synthetic control data" ( https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data#FootnoteMarker3). I choose to use the dirichlet clustering algorithm. It seems to me that every procedure works fine and the clustering results have been generated. The output files are listed as follows: ~/workspaceMahout/mahout/trunk/examples/output% ls clusteredPoints clusters-0 clusters-1 clusters-2 clusters-3 clusters-4 clusters-5 data Currently, I have several questions on how to analyze these data. 1) What does the "data" fold stand for in the output directory? 2) I tried to use ldatopics to obtain the result. For the "input vector directory", should I set it as -i ./examples/output/clusters-5 3) What does the input dictionary file mean? During my clustering process, ( $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job), I was not asked to give any dictonary file. Thank you very much for the help. wenyia
