I'm quite new to Mahout and also about 5 years rusty in Java. The code below holds a mystery for me. It's an adoption of a KMeans run from an example whose origin I forget, followed by an input adopted from an early clustering example from the MiA book.
The mystery is that unlike the MiA example, there is no file name, just a directory path offered to the the KMeansDriver. Yet the SequenceFileReader needs to know the full file path, the last part of which I only discovered by examining the output directory. So I suppose that KMeansDriver must have some default output file name and construct from that if the output Path is offered only as a directory. But I couldn't find any documentation about (what I imagine) is going on. Basically, I don't care what the cluster output file is named, but if possible, my applications will need to know where to find the clusters knowing at most the directory containing them. (Most likely with the CanopyDriver, but maybe this is really all about AbstractJob ??? ) The code is executed under eclipse Kepler junit4 with a Maven build specifying Mahout 0.8, running on Ubuntu 13.04 Linux. Thanks for any advice. Bob Morris public void test() throws IOException, ClassNotFoundException, InterruptedException { String datasetDir = "Rubiaeceae/"; // / or fungi/ for now String vectorFile = "occurrenceVectors/"+datasetDir+"vectors"; //input, populated e.g. by SequenceFileTest; String clusterOutputDir = "clusterOutput/"; Path samples = new Path(vectorFile); Configuration conf = new Configuration(); Path outputDir = new Path(clusterOutputDir+datasetDir); HadoopUtil.delete(conf, outputDir); EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure(); Path clustersIn = new Path(outputDir, "random-seeds"); RandomSeedGenerator.buildRandom(conf, samples, clustersIn, 300, measure); KMeansDriver.run(samples, clustersIn, outputDir, measure, 0.01, 10, true, 0.0, true); //System.out.println(KMeansDriver.getOutputFile()) ; //Java claims getOutputFile() not visible, even if set up with a naughty non-static call //Following MiA p. 124 FileSystem fs = FileSystem.get(conf); String clusterOutputFile = clusterOutputDir+datasetDir+Cluster.CLUSTERED_POINTS_DIR+"/part-m-0"; SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(clusterOutputFile), conf); IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); while (reader.next(key, value)) { System.out.println(value.toString() + " in cluster " +key.toString() ); } reader.close(); } } -- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Filtered Push Project Harvard University Herbaria Harvard University email: morris....@gmail.com web: http://efg.cs.umb.edu/ web: http://wiki.filteredpush.org http://www.cs.umb.edu/~ram === The content of this communication is made entirely on my own behalf and in no way should be deemed to express official positions of The University of Massachusetts at Boston or Harvard University.