This is from way back in old brain cells that may be suspect. The dictionary is created in the text pipeline to map tokens to Mahout ids. It allows Clusterdump to tell you what the frequent terms are in clusters instead of the numbers used as internal Mahout as IDs.
You must have some mapping yourself that you originally used to vectorize your data? Something like “ampule” => 23 or the like for the other data types? I wouldn’t try to make the dictionary work. Just reverse the mapping from the internal Mahout Ids to your external Ids. 23 => “ampule” Don’t give clusterdump a dictionary—it's optional. I use it on data with an external dictionary all the time. On Mar 23, 2014, at 4:25 PM, Bob Morris <[email protected]> wrote: I'm a mahout novice trying to do some semantic data clustering with Canopy clustering on some low-dimensional SequenceFiles that I vectorized with ad-hoc java code. (Some features are strings vextorized by the Levenstein distance from a constant, some are DateTime objects vectorized as milliseconds from the Unix era, some are georeferences, etc. etc.). The results look promising, but I want to get more detail out of the clusters, than I understand how to get from ClusterDumper alone. In particulary, it seems that CSVClusterWriter should get me what I need (for each cluster, the center and the list of vectors ordered by distance. When I vectorized, I never explicitly built a Dictionary, which is---I suppose---why I get a runtime ClassCastException when I invoke ClusterDumper.readPoints(...), despite telling the ClusterDumper run method that the dictionary type is "sequencefile", but have no sequencefile to offer. So I have these questions: 1. Am I right that the Exception in the dumper is caused by not having a Dictionary file? 2. Where can I find documentation for the correct form of a sequencefile Dictionary and are there any convenience methods for building it? I start with a CSV file for the data, together with a Map that associates column header names with a private type name that specifies the algorithm to be applied to the vectorization.) I can send the vectorization if helpful. Thanks in advance; --Bob Here's the dumper code with point of ClassCastException indicated public void test() throws Exception { String datasetDir = "Lichen/"; // bbg, Rubiaeceae/ or fungi/ for now String inputFile = "/tmp/vectors"; //inputDir + "vectors"; // input, String canopyOutput = "/tmp/clusters"; String dumperInput = canopyOutput+"/clusters-0-final"; String dumperOutput = "/tmp/clusters.txt"; String clusterInput = dumperInput+"/"+"part-r-00000"; String clusterOutput = "/tmp/clusterDetail.txt"; boolean runSequential = true; try { String[] args = {"-i", inputFile, "-o", canopyOutput, "-t1", ".00000002", "-t2", ".00000001", "-ow"}; CanopyDriver driver = new CanopyDriver(); driver.run(args); //must need Path to the sequence file here also? String[] dumpArgs = {"-i", dumperInput, "-o", dumperOutput, "-dt", "sequencefile"}; ClusterDumper dumper = new ClusterDumper(); dumper.run(dumpArgs); PrintWriter writer = new PrintWriter(new File(clusterOutput)); Path pointsPathDir = new Path(dumperInput); Configuration conf=new Configuration(); ////// Line below throws runtime //// ////// java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable//// // Presumably need a Dictionary to pass via -d to ClusterDumper Map<Integer, List<WeightedPropertyVectorWritable>> clusterIdToPoints = ClusterDumper.readPoints(pointsPathDir, 10000, conf); //TODO: iterate over Map and output with CSVClusterWriter csvClusterWriter = new CSVClusterWriter(writer, clusterIdToPoints, measure); } catch (Exception e) { System.out.println("test caught Exception"); e.printStackTrace(System.out); } } -- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Filtered Push Project Harvard University Herbaria Harvard University email: [email protected] web: http://efg.cs.umb.edu/ web: http://wiki.filteredpush.org http://www.cs.umb.edu/~ram === The content of this communication is made entirely on my own behalf and in no way should be deemed to express official positions of The University of Massachusetts at Boston or Harvard University.
