Mystery about filename of clustering output.

Bob Morris Tue, 21 Jan 2014 17:26:59 -0800

I'm quite new to Mahout and also about 5 years rusty in Java.  The
code below holds a mystery for me. It's an adoption of a KMeans run
from an example whose origin I forget, followed by an input adopted
from an early clustering example from the MiA book.


The mystery is that unlike the MiA example, there is no file name,
just a directory path offered to the the KMeansDriver. Yet the
SequenceFileReader needs to know the full file path, the last part of
which I only discovered by examining the output directory. So I
suppose that KMeansDriver must have some default output file name and
construct from that if the output Path is offered only as a directory.
 But I couldn't find any documentation about (what I imagine) is going
on.

Basically, I don't care what the cluster output file is named, but if
possible, my applications will need to know where to find the clusters
knowing at most the directory containing them. (Most likely with the
CanopyDriver, but maybe this is really all about AbstractJob ??? )

The code is executed under  eclipse Kepler junit4 with a Maven build
specifying Mahout 0.8, running on Ubuntu 13.04 Linux.

Thanks for any advice.
Bob Morris


public void test() throws IOException, ClassNotFoundException,
InterruptedException {


        String datasetDir = "Rubiaeceae/"; // / or fungi/ for now
        String vectorFile = "occurrenceVectors/"+datasetDir+"vectors";
//input, populated e.g. by SequenceFileTest;
        String clusterOutputDir = "clusterOutput/";
        Path samples = new Path(vectorFile);
        Configuration conf = new Configuration();
        Path outputDir = new Path(clusterOutputDir+datasetDir);
        HadoopUtil.delete(conf, outputDir);
        EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure();
        Path clustersIn = new Path(outputDir, "random-seeds");

        RandomSeedGenerator.buildRandom(conf, samples, clustersIn,
300, measure);
        KMeansDriver.run(samples, clustersIn, outputDir, measure,
0.01, 10, true,
                0.0, true);
        //System.out.println(KMeansDriver.getOutputFile()) ; //Java
claims getOutputFile() not visible, even if set up with a naughty
non-static call

        //Following MiA p. 124
        FileSystem fs = FileSystem.get(conf);
        String clusterOutputFile =
clusterOutputDir+datasetDir+Cluster.CLUSTERED_POINTS_DIR+"/part-m-0";
        SequenceFile.Reader reader = new SequenceFile.Reader(fs,
                new Path(clusterOutputFile), conf);
        IntWritable key = new IntWritable();
        WeightedVectorWritable value = new WeightedVectorWritable();
        while (reader.next(key, value)) {
            System.out.println(value.toString() + " in cluster "
+key.toString() );
            }

        reader.close();

    }

}

-- 
Robert A. Morris

Emeritus Professor  of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390


Filtered Push Project
Harvard University Herbaria
Harvard University

email: morris....@gmail.com
web: http://efg.cs.umb.edu/
web: http://wiki.filteredpush.org
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.

Mystery about filename of clustering output.

Reply via email to