Hello,

I have been trying to use the mahout k-means to cluster some synthetic data (A3-set <https://cs.joensuu.fi/sipu/datasets/>) as testing reference but seems I'm missing something here. I'm quite the beginner so bear with me, please. Before all the explanations, what I want to know is:

A) Is there a better way to cluster a series of numeric points and maintain their id value?
B) What I'm doing wrong with this approach?

-Thanks for your patience



This is a sample of the  data:

point1,53920,42968
point2,52019,42206
point3,52570,42476
point4,54220,42081
point5,54268,43420
point6,52288,42408
point7,54436,39727
point8,52391,44323
point9,54995,43655
point10,53761,43403

As you can see, is in the format: id_point, value1, value2 (the comma doesn't matter, I could use tab separators)

All testing data that I have seen that uses numeric data (not the reuters example) doesn't have a id_column of sorts to identify the point after the clusterdump, just the values. Is there a way to map the points to a ID using the command line?

In stackoverflow <https://stackoverflow.com/questions/8785392/how-to-perform-k-means-clustering-in-mahout-with-vector-data-stored-as-csv> I saw a code sample to create my own fileSequencer using namedVectors to capture the ID, I created a similar version (code at the end), and creates in the HDFS a sequenceFile

sample:

hadoop fs -text  /tmp/mahout-another/input/testSequence/ | head
point1  point1:{0:53920.0,1:42968.0}
point2  point2:{0:52019.0,1:42206.0}
point3  point3:{0:52570.0,1:42476.0}
point4  point4:{0:54220.0,1:42081.0}
point5  point5:{0:54268.0,1:43420.0}
point6  point6:{0:52288.0,1:42408.0}
point7  point7:{0:54436.0,1:39727.0}
point8  point8:{0:52391.0,1:44323.0}
point9  point9:{0:54995.0,1:43655.0}
point10 point10:{0:53761.0,1:43403.0}


Then I cluster with:

mahout kmeans -k 50 -i /tmp/mahout-another/input/testSequence/ -o /tmp/mahout-another/output -c /tmp/mahout-initial-clusters --maxIter 10

After completion, the clusterdump with:

mahout clusterdump -i /tmp/mahout-another/output/clusters-10-final -o output.txt -p /tmp/mahout-another/input/testSequence

throws me:

Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable

I'm guessing that the problem lies in the SequenceFile writer (code below) since with requires two class parameters, and following the stackoverflow example I write the name of the vector as a Text.class and the vector itself after it. The clusterdump then tries to read the points, finds text and an exception is thrown, but if that's the case, how can I do it?




Code for the sequence writer with namedVectors.:


main:

BufferedReader br = new BufferedReader(new FileReader(args[0]));
            String line;
            List<NamedVector> vector = new ArrayList<NamedVector>();
            while ((line = br.readLine()) != null) {
                    String[] tokenized = line.split(args[2]);
                    String id = tokenized[INDEX_ID];
                    double[] coordinates = new double[tokenized.length-1];
                    for (int i=1;i<tokenized.length;i++){
coordinates[i-1]=Double.valueOf(tokenized[i]);
                    }
                    vector.add(createNamedVector(coordinates,id));
                }
            writeSequenceToPath(args[1], vector);
                br.close();


public static NamedVector createNamedVector(double[] points, String id){
        return new NamedVector(new DenseVector(points),id);
    }

public static void writeSequenceToPath(String directory, List<NamedVector> listOfVectors) throws IOException{
        Configuration config = new Configuration();
        FileSystem fs = FileSystem.get(config);

        Path path = new Path(directory);
        //write a SequenceFile form a Vector
* SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);*
        VectorWritable vec = new VectorWritable();
        for(NamedVector v:listOfVectors){
            vec.set(v);
            writer.append(new Text(v.getName()), vec);
        }
        writer.close();

    }




--
*Ramiro Manso*
/Data Analyst & BI consultant/

Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain

_http://www.bidoop.es_

Reply via email to