KMeansDriver with non-numeric values

Aleksander Sadecki Tue, 17 Jun 2014 05:05:21 -0700

Hi,

I am using KMeansDriver to create few clusters for my input data. For numeric 
data it works perfectly. I have no idea how to make it work with more complex 
values. Let's assume that we have got the following example:
  
  | car     | color      | year of production | price
--|-----------------------------------------------------
1 | Audi A1 | dark red   | 2009               | 65.000 £
2 | Audi A2 | green      | 2011               | 75.000 £
3 | Audi A3 | red        | 2012               | 80.000 £
4 | BMW X3  | black      | 2009               | 72.000 £
5 | BMW X4  | black      | 2013               | 82.000 £
6 | BMW X5  | white      | 2014               | 89.000 £


I have no idea how to put this data to the KMeansDriver.

I have got few requairments for my clusters:

i) Audi A1, A2, A3 are really similar, these 3 could create one cluster. The 
same for BMW X1, X2, X3.

ii) For colors, we can find 4 groups: white, black, green, red + dark red

iii) Year of production, f.e. 2 groups, before and after 2010

iv) here we can find f.e. 3 clusters 60-70, 70-80, 80+ £

Thank you in advance
Alex

Here you can see my piece of code:

                List<Vector> vectors = getPointsFromFile(filename);

                File testData = new File("testdata");
                
                if (!testData.exists()) { testData.mkdir();}
                
                testData = new File("testdata/points");
                
                if (!testData.exists()) {testData.mkdir();}

                Configuration conf = new Configuration();
                FileSystem fs = FileSystem.get(conf);
                ClusterHelper.writePointsToFile(vectors, conf, new Path( 
"testdata/points/file1"));

                Path path = new Path("testdata/clusters/part-00000");
                SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, 
path, Text.class, Kluster.class);

                for (int i = 0; i < k; i++) {
                        Vector vec = vectors.get(i);
                        Kluster cluster = new Kluster(vec, i, new 
EuclideanDistanceMeasure());
                        writer.append(new Text(cluster.getIdentifier()), 
cluster);
                }
                writer.close();

                Path output = new Path("output");
                HadoopUtil.delete(conf, output);

                KMeansDriver.run(conf, new Path("testdata/points"), new 
Path("testdata/clusters"), output, 0.001, 10, true, 0.0, false);


where getPointsFromFile(filename) method looks like this:


        public static List<Vector> getPointsFromFile(String file) {
                List<Vector> points = new ArrayList<Vector>();
                String sCurrentLine;
                BufferedReader br = null;
                try {
                        br = new BufferedReader(new FileReader(file));
                        while ((sCurrentLine = br.readLine()) != null) {
                                String[] splitted = sCurrentLine.split(",");
                                double[] fr = new double[splitted.length];
                                for (int i = 0; i < splitted.length; i++) {
                                        fr[i] = Double.parseDouble(splitted[i]);
                                }
                                Vector vec = new 
RandomAccessSparseVector(fr.length);
                                vec.assign(fr);
                                points.add(vec);
                        }
                } catch (FileNotFoundException e) {

                        e.printStackTrace();
                } catch (IOException e) {
                        e.printStackTrace();
                }

                return points;
        }

KMeansDriver with non-numeric values

Reply via email to